---=====================
HumEval Workshop on Human Evaluation of NLP Systems at EACL'21
19 April 2021
Online from Kyiv, Ukraine
https://humeval.github.io/
---=====================
CALL FOR PARTICIPATION
Early registration ends: 7 April 2021
Invited Speakers: Margaret Mitchell and Lucia Specia
Open-mic session: For information regarding how to participate in this open
discussion session please see in programme below.
Programme https://humeval.github.io/programme/
09:00–09:10
Opening
Chair: Anya Belz
09:10–10:00
Invited Talk: Disagreement in Human Evaluation: Blame the Task not the
Annotators
by Lucia Specia https://www.imperial.ac.uk/people/l.specia, Imperial
College London and University of Sheffield
It is well known that human evaluators are prone to disagreement and that
this is a problem for reliability and reproducibility of evaluation
experiments. The reasons for disagreement can fall into two broad
categories: (1) human evaluator, including under-trained,
under-incentivised, lacking expertise, or ill-intended individuals, e.g.,
cheaters; and (2) task, including ill-definition, poor guidelines,
suboptimal setup, or inherent subjectivity. While in an ideal evaluation
experiment many of these elements will be controlled for, I argue that task
subjectivity is a much harder issue. In this talk I will cover a number of
evaluation experiments on tasks with variable degrees of subjectivity,
discuss their levels of disagreement along with other issues, and cover a
few practical approaches do address them. I hope this will lead to an open
discussion on possible strategies and directions to alleviate this problem.
10:00–11:00
Oral Session 1 (NLG)
10:00–10:20
It’s Commonsense, isn’t it? Demystifying Human Evaluations in
Commonsense-Enhanced NLG systems
Miruna-Adriana Clinciu, Dimitra Gkatzia and Saad Mahamood
10:20–10:40
Estimating Subjective Crowd-Evaluations as an Additional Objective to
Improve Natural Language Generation
Jakob Nyberg, Maike Paetzel and Ramesh Manuvinakurike
10:40–11:00
Trading Off Diversity and Quality in Natural Language Generation
Hugh Zhang, Daniel Duckworth, Daphne Ippolito and Arvind Neelakantan
11:00–11:30
Break
11:30–12:10
Oral Session 2 (MT)
11:30–11:50
Towards Document-Level Human MT Evaluation: On the Issues of Annotator
Agreement, Effort and Misevaluation
Sheila Castilho
11:50–12:10
Is This Translation Error Critical?: Classification-Based Human and
Automatic Machine Translation Evaluation Focusing on Critical Errors
Katsuhito Sudoh, Kosuke Takahashi and Satoshi Nakamura
12:10–13:30
Poster Session
- Towards Objectively Evaluating the Quality of Generated Medical Summaries
Francesco Moramarco, Damir Juric, Aleksandar Savkov and Ehud Reiter
- A Preliminary Study on Evaluating Consultation Notes With Post-Editing
Francesco Moramarco, Alex Papadopoulos Korfiatis, Aleksandar Savkov and
Ehud Reiter
- The Great Misalignment Problem in Human Evaluation of NLP Methods
Mika Hämäläinen and Khalid Alnajjar
- A View From the Crowd: Evaluation Challenges for Time-Offset Interaction
Applications
Alberto Chierici and Nizar Habash
- Reliability of Human Evaluation for Text Summarization: Lessons Learned
and Challenges Ahead
Neslihan Iskender, Tim Polzehl and Sebastian Möller
- On User Interfaces for Large-Scale Document-Level Human Evaluation of
Machine Translation Outputs
Roman Grundkiewicz, Marcin Junczys-Dowmunt, Christian Federmann and Tom
Kocmi
- Eliciting Explicit Knowledge From Domain Experts in Direct Intrinsic
Evaluation of Word Embeddings for Specialized Domains
Goya van Boven and Jelke Bloem
- Detecting Post-Edited References and Their Effect on Human Evaluation
Věra Kloudová, Ondřej Bojar and Martin Popel
13:30–15:00
Lunch
15:00–15:40
Oral Session 3
15:00–15:20
A Case Study of Efficacy and Challenges in Practical Human-in-Loop
Evaluation of NLP Systems Using Checklist
Shaily Bhatt, Rahul Jain, Sandipan Dandapat and Sunayana Sitaram
15:20–15:40
Interrater Disagreement Resolution: A Systematic Procedure to Reach
Consensus in Annotation Tasks
Yvette Oortwijn, Thijs Ossenkoppele and Arianna Betti
15:40–16:40
Open-Mic Discussion Panel
Chair: Ehud Reiter
Discussion session will be open to all participants. Anyone who is
interested in speaking for 3 mins about any topic relevant to the workshop
should email Ehud Reiter (e.reiter@abdn.ac.uk). We will follow these short
presentations by a general discussion.
16:40–17:00
Break
17:00–17:50
Invited Talk: The Ins and Outs of Ethics-Informed Evaluation
by Margaret Mitchell http://www.m-mitchell.com/
The modern train/test paradigm in Artificial Intelligence (AI) and Machine
Learning (ML) narrows what we can understand about AI models, and skews our
understanding of models’ robustness in different environments. In this
talk, I will work through the different factors involved in ethics-informed
AI evaluation, including connections to ML training and ML fairness, and
present an overarching evaluation protocol that addresses a multitude of
considerations in developing ethical AI.
17:50–18:00
Closing
======================================================
HumEval Workshop on Human Evaluation of NLP Systems at EACL'21
19 April 2021
Online from Kyiv, Ukraine
https://humeval.github.io/
======================================================
CALL FOR PARTICIPATION
Early registration ends: 7 April 2021
Invited Speakers: Margaret Mitchell and Lucia Specia
Open-mic session: For information regarding how to participate in this open
discussion session please see in programme below.
Programme <https://humeval.github.io/programme/>
09:00–09:10
Opening
Chair: Anya Belz
09:10–10:00
Invited Talk: Disagreement in Human Evaluation: Blame the Task not the
Annotators
by Lucia Specia <https://www.imperial.ac.uk/people/l.specia>, Imperial
College London and University of Sheffield
It is well known that human evaluators are prone to disagreement and that
this is a problem for reliability and reproducibility of evaluation
experiments. The reasons for disagreement can fall into two broad
categories: (1) human evaluator, including under-trained,
under-incentivised, lacking expertise, or ill-intended individuals, e.g.,
cheaters; and (2) task, including ill-definition, poor guidelines,
suboptimal setup, or inherent subjectivity. While in an ideal evaluation
experiment many of these elements will be controlled for, I argue that task
subjectivity is a much harder issue. In this talk I will cover a number of
evaluation experiments on tasks with variable degrees of subjectivity,
discuss their levels of disagreement along with other issues, and cover a
few practical approaches do address them. I hope this will lead to an open
discussion on possible strategies and directions to alleviate this problem.
10:00–11:00
Oral Session 1 (NLG)
10:00–10:20
It’s Commonsense, isn’t it? Demystifying Human Evaluations in
Commonsense-Enhanced NLG systems
Miruna-Adriana Clinciu, Dimitra Gkatzia and Saad Mahamood
10:20–10:40
Estimating Subjective Crowd-Evaluations as an Additional Objective to
Improve Natural Language Generation
Jakob Nyberg, Maike Paetzel and Ramesh Manuvinakurike
10:40–11:00
Trading Off Diversity and Quality in Natural Language Generation
Hugh Zhang, Daniel Duckworth, Daphne Ippolito and Arvind Neelakantan
11:00–11:30
Break
11:30–12:10
Oral Session 2 (MT)
11:30–11:50
Towards Document-Level Human MT Evaluation: On the Issues of Annotator
Agreement, Effort and Misevaluation
Sheila Castilho
11:50–12:10
Is This Translation Error Critical?: Classification-Based Human and
Automatic Machine Translation Evaluation Focusing on Critical Errors
Katsuhito Sudoh, Kosuke Takahashi and Satoshi Nakamura
12:10–13:30
Poster Session
- Towards Objectively Evaluating the Quality of Generated Medical Summaries
Francesco Moramarco, Damir Juric, Aleksandar Savkov and Ehud Reiter
- A Preliminary Study on Evaluating Consultation Notes With Post-Editing
Francesco Moramarco, Alex Papadopoulos Korfiatis, Aleksandar Savkov and
Ehud Reiter
- The Great Misalignment Problem in Human Evaluation of NLP Methods
Mika Hämäläinen and Khalid Alnajjar
- A View From the Crowd: Evaluation Challenges for Time-Offset Interaction
Applications
Alberto Chierici and Nizar Habash
- Reliability of Human Evaluation for Text Summarization: Lessons Learned
and Challenges Ahead
Neslihan Iskender, Tim Polzehl and Sebastian Möller
- On User Interfaces for Large-Scale Document-Level Human Evaluation of
Machine Translation Outputs
Roman Grundkiewicz, Marcin Junczys-Dowmunt, Christian Federmann and Tom
Kocmi
- Eliciting Explicit Knowledge From Domain Experts in Direct Intrinsic
Evaluation of Word Embeddings for Specialized Domains
Goya van Boven and Jelke Bloem
- Detecting Post-Edited References and Their Effect on Human Evaluation
Věra Kloudová, Ondřej Bojar and Martin Popel
13:30–15:00
Lunch
15:00–15:40
Oral Session 3
15:00–15:20
A Case Study of Efficacy and Challenges in Practical Human-in-Loop
Evaluation of NLP Systems Using Checklist
Shaily Bhatt, Rahul Jain, Sandipan Dandapat and Sunayana Sitaram
15:20–15:40
Interrater Disagreement Resolution: A Systematic Procedure to Reach
Consensus in Annotation Tasks
Yvette Oortwijn, Thijs Ossenkoppele and Arianna Betti
15:40–16:40
Open-Mic Discussion Panel
Chair: Ehud Reiter
Discussion session will be open to all participants. Anyone who is
interested in speaking for 3 mins about any topic relevant to the workshop
should email Ehud Reiter (e.reiter@abdn.ac.uk). We will follow these short
presentations by a general discussion.
16:40–17:00
Break
17:00–17:50
Invited Talk: The Ins and Outs of Ethics-Informed Evaluation
by Margaret Mitchell <http://www.m-mitchell.com/>
The modern train/test paradigm in Artificial Intelligence (AI) and Machine
Learning (ML) narrows what we can understand about AI models, and skews our
understanding of models’ robustness in different environments. In this
talk, I will work through the different factors involved in ethics-informed
AI evaluation, including connections to ML training and ML fairness, and
present an overarching evaluation protocol that addresses a multitude of
considerations in developing ethical AI.
17:50–18:00
Closing