Empathy List Archives

sigdial@list.sigdial.org

SIGdial Mailing List

Call for Papers: HumEval 2024 @ LREC-COLING 2024

Simone Balloccu

Wed, Jan 10, 2024 1:12 PM

Important dates:
Submission deadline: 11 March 2024
Paper acceptance notification: 4 April 2024
Camera-ready versions: 19 April 2024
HumEval 2024: 21 May 2024
LREC-COLING 2024 conference: 20–25 May 2024

All deadlines are 23:59 UTC-12.

---==============

Human evaluation plays a central role in NLP, from the large-scale
crowd-sourced evaluations carried out e.g. by the WMT workshops, to the
much smaller experiments routinely encountered in conference papers.
Moreover, while NLP embraced a number of automatic evaluation metrics, the
field has always been acutely aware of their limitations (Callison-Burch et
al., 2006; Reiter and Belz, 2009; Novikova et al., 2017; Reiter, 2018;
Mathur et al., 2020a), and has gauged their trustworthiness in terms of how
well, and how consistently, they correlate with human evaluation scores
(Gatt and Belz, 2008; Popović and Ney, 2011., Shimorina, 2018; Mille et
al., 2019; Dušek et al., 2020, Mathur et al., 2020b). Yet there is growing
unease about how human evaluations are conducted in NLP. Researchers have
pointed out the less than perfect experimental and reporting standards that
prevail (van der Lee et al., 2019; Gehrmann et al., 2023), and that
low-quality evaluations with crowdworkers may not correlate well with
high-quality evaluations with domain experts (Freitag et al., 2021). Only a
small proportion of papers provide enough detail for reproduction of human
evaluations, and in many cases the information provided is not even enough
to support the conclusions drawn (Belz et al., 2023). We have found that
more than 200 different quality criteria (such as Fluency, Accuracy,
Readability, etc.) have been used in NLP, and that different papers use the
same quality criterion name with different definitions, and the same
definition with different names (Howcroft et al., 2020). Furthermore, many
papers do not use a named criterion, asking the evaluators only to assess
'how good' the output is. Inter and intra-annotator agreement are usually
given only in the form of an overall number without analysing the reasons
and causes for disagreement and potential to reduce them. A small number of
papers have aimed to address this from different perspectives, e.g.
comparing agreement for different evaluation methods (Belz and Kow, 2010),
or analysing errors and linguistic phenomena related to disagreement
(Pavlick and Kwiatkowski, 2019; Oortwijn et al., 2021; Thomson and Reiter,
2020; Popović, 2021). Context beyond sentences needed for a reliable
evaluation has also started to be investigated (e.g. Castilho et al.,
2020). The above aspects all interact in different ways with the
reliability and reproducibility of human evaluation measures. While
reproducibility of automatically computed evaluation measures has attracted
attention for a number of years (e.g. Pineau et al., 2018, Branco et al.,
2020), research on reproducibility of measures involving human evaluations
is a more recent addition (Cooper & Shardlow, 2020; Belz et al., 2023).

The HumEval workshops (previously at EACL 2021, ACL 2022, and RANLP 2023)
aim to create a forum for current human evaluation research and future
directions, a space for researchers working with human evaluations to
exchange ideas and begin to address the issues human evaluation in NLP
faces in many respects, including experimental design, meta-evaluation and
reproducibility. We will invite papers on topics including, but not limited
to, the following topics as addressed in any subfield of NLP

Experimental design and methods for human evaluations
Reproducibility of human evaluations
Inter-evaluator and intra-evaluator agreement
Ethical considerations in human evaluation of computational systems
Quality assurance for human evaluation
Crowdsourcing for human evaluation
Issues in meta-evaluation of automatic metrics by correlation with human
evaluations
Alternative forms of meta-evaluation and validation of human evaluations
Comparability of different human evaluations
Methods for assessing the quality and the reliability of human evaluations
Role of human evaluation in the context of Responsible and Accountable AI

Submissions for both short and long papers will be made directly via START,
following submission guidelines issued by LREC-COLING 2024. For full
submission details please refer to the workshop website.

The third ReproNLP Shared Task on Reproduction of Automatic and Human
Evaluations of NLP Systems will be part of HumEval, offering (A) an Open
Track for any reproduction studies involving human evaluation of NLP
systems; and (B) the ReproHum Track where participants will reproduce the
papers currently being reproduced by partner labs in the EPSRC ReproHum
project. A separate call will be issued for ReproNLP 2024.

--
Kind regards, Simone Balloccu.

The Fourth Workshop on Human Evaluation of NLP Systems (HumEval 2024) invites the submission of long and short papers on current human evaluation research and future directions. HumEval 2024 will take place in Turin (Italy) on May 21 2024, during LREC-COLING 2024. Website: https://humeval.github.io/ Important dates: Submission deadline: 11 March 2024 Paper acceptance notification: 4 April 2024 Camera-ready versions: 19 April 2024 HumEval 2024: 21 May 2024 LREC-COLING 2024 conference: 20–25 May 2024 All deadlines are 23:59 UTC-12. =============================================== Human evaluation plays a central role in NLP, from the large-scale crowd-sourced evaluations carried out e.g. by the WMT workshops, to the much smaller experiments routinely encountered in conference papers. Moreover, while NLP embraced a number of automatic evaluation metrics, the field has always been acutely aware of their limitations (Callison-Burch et al., 2006; Reiter and Belz, 2009; Novikova et al., 2017; Reiter, 2018; Mathur et al., 2020a), and has gauged their trustworthiness in terms of how well, and how consistently, they correlate with human evaluation scores (Gatt and Belz, 2008; Popović and Ney, 2011., Shimorina, 2018; Mille et al., 2019; Dušek et al., 2020, Mathur et al., 2020b). Yet there is growing unease about how human evaluations are conducted in NLP. Researchers have pointed out the less than perfect experimental and reporting standards that prevail (van der Lee et al., 2019; Gehrmann et al., 2023), and that low-quality evaluations with crowdworkers may not correlate well with high-quality evaluations with domain experts (Freitag et al., 2021). Only a small proportion of papers provide enough detail for reproduction of human evaluations, and in many cases the information provided is not even enough to support the conclusions drawn (Belz et al., 2023). We have found that more than 200 different quality criteria (such as Fluency, Accuracy, Readability, etc.) have been used in NLP, and that different papers use the same quality criterion name with different definitions, and the same definition with different names (Howcroft et al., 2020). Furthermore, many papers do not use a named criterion, asking the evaluators only to assess 'how good' the output is. Inter and intra-annotator agreement are usually given only in the form of an overall number without analysing the reasons and causes for disagreement and potential to reduce them. A small number of papers have aimed to address this from different perspectives, e.g. comparing agreement for different evaluation methods (Belz and Kow, 2010), or analysing errors and linguistic phenomena related to disagreement (Pavlick and Kwiatkowski, 2019; Oortwijn et al., 2021; Thomson and Reiter, 2020; Popović, 2021). Context beyond sentences needed for a reliable evaluation has also started to be investigated (e.g. Castilho et al., 2020). The above aspects all interact in different ways with the reliability and reproducibility of human evaluation measures. While reproducibility of automatically computed evaluation measures has attracted attention for a number of years (e.g. Pineau et al., 2018, Branco et al., 2020), research on reproducibility of measures involving human evaluations is a more recent addition (Cooper & Shardlow, 2020; Belz et al., 2023). The HumEval workshops (previously at EACL 2021, ACL 2022, and RANLP 2023) aim to create a forum for current human evaluation research and future directions, a space for researchers working with human evaluations to exchange ideas and begin to address the issues human evaluation in NLP faces in many respects, including experimental design, meta-evaluation and reproducibility. We will invite papers on topics including, but not limited to, the following topics as addressed in any subfield of NLP - Experimental design and methods for human evaluations - Reproducibility of human evaluations - Inter-evaluator and intra-evaluator agreement - Ethical considerations in human evaluation of computational systems - Quality assurance for human evaluation - Crowdsourcing for human evaluation - Issues in meta-evaluation of automatic metrics by correlation with human evaluations - Alternative forms of meta-evaluation and validation of human evaluations - Comparability of different human evaluations - Methods for assessing the quality and the reliability of human evaluations - Role of human evaluation in the context of Responsible and Accountable AI Submissions for both short and long papers will be made directly via START, following submission guidelines issued by LREC-COLING 2024. For full submission details please refer to the workshop website. The third ReproNLP Shared Task on Reproduction of Automatic and Human Evaluations of NLP Systems will be part of HumEval, offering (A) an Open Track for any reproduction studies involving human evaluation of NLP systems; and (B) the ReproHum Track where participants will reproduce the papers currently being reproduced by partner labs in the EPSRC ReproHum project. A separate call will be issued for ReproNLP 2024. -- Kind regards, Simone Balloccu.