INLG 2024 Tutorial on Human Evaluation of NLP System Quality: Call for in-person and online participation

CT
Craig Thomson
Mon, Sep 16, 2024 2:26 PM

INLG 2024 Tutorial on Human Evaluation of NLP System Quality

24th September 2024 at INLG 2024, Tokyo

*Available to both in-person and remote attendees. Registration for INLG is
open now: *https://amarys-jtb.jp/INLG2024/
https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Famarys-jtb.jp%2FINLG2024%2F&data=05%7C02%7Cc.thomson%40ABDN.AC.UK%7C019744129e4b467ca76908dcd579893d%7C8c2b19ad5f9c49d490773ec3cfc52b3f%7C0%7C0%7C638619965949210013%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=XXKU7ypzrdDiQxXeH%2FD1pH2WC2mbzOZpHNont72tEpE%3D&reserved=0
*We will release all slides and colab notebooks from the tutorial: *
https://human-evaluation-tutorial.github.io/
https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhuman-evaluation-tutorial.github.io%2F&data=05%7C02%7Cc.thomson%40ABDN.AC.UK%7C019744129e4b467ca76908dcd579893d%7C8c2b19ad5f9c49d490773ec3cfc52b3f%7C0%7C0%7C638619965949223360%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=Gt36vXFjiGtSOR%2F1VLKlQHcDsnxyFeQ96e4TBNjlukk%3D&reserved=0

Anya Belz†
João Sedoc⚘
Craig Thomson†
Simon Mille†
Rudali Huidrom†
†ADAPT Research Centre, Dublin City University, Ireland
⚘New York University, USA

Description:

Human evaluation has always been considered the most reliable form of
evaluation in Natural Language Processing (NLP), but recent research has
thrown up a number of concerning issues, including in the design (Belz et
al., 2020; Howcroft et al., 2020) and execution (Thomson et al., 2024) of
human evaluation experiments. Standardisation and comparability across
different experiments is low, as is reproducibility in the sense that
repeat runs of the same evaluation often do not support the same main
conclusions, quite apart from not producing similar scores.

The current situation is likely to be in part due to how human evaluation
is viewed in NLP: not as something that needs to be studied and learnt
before venturing into conducting an evaluation experiment, but something
that anyone can throw together without prior knowledge by pulling in a
couple of students from the lab next door.

Our aim with this tutorial is primarily to inform participants about the
range of options available and choices that need to be made when creating
human evaluation experiments, and what the implications of different
decisions are. Moreover, we will present best practice principles and
practical tools that help researchers design scientifically rigorous,
informative and reliable experiments.

As the next section indicates we are planning for a morning of
presentations and brief exercises, followed by a practical session in the
afternoon where participants will be supported in creating evaluation
experiments and analysing results from them, using tools and other
resources provided by the tutorial team.

We aim to address all aspects of human evaluation of system outputs in a
research setting, equipping participants with the knowledge, tools,
resources and hands-on experience needed to design and execute rigorous and
reliable human evaluation experiments. Take-home materials and online
resources will continue to support participants in conducting experiments
after the tutorial.

Schedule:

Time Unit #: Topic
09:30—10:00 Unit 1: Introduction
10:00—10:30 Unit 2: Development and Components of Human Evaluations
10:30—10:45 Break
10:45—11:45 Unit 3: Quality Criteria and Evaluation Modes
11:45—12:30 Unit 4: Experiment Design
12:30—14:00 Lunch
14:00—15:15 Unit 5: Statistical Analysis of Results
15:15—15:30 Break
15:30—16:15 Unit 6: Experiment Implementation
16:15—16:40 Unit 7: Experiment Execution
16:40—16:55 Break
16:55—18:30 Unit 8: Practical Session

Summary paper:

Anya Belz, João Sedoc, Craig Thomson, Simon Mille and Rudali Huidrom. 2024. The
INLG 2024 Tutorial on Human Evaluation of NLP System Quality: Background,
Overall Aims, and Summaries of Taught Units
https://aclanthology.org/2024.inlg-tutorials.1. In Proceedings of the
17th International Conference on Natural Language Generation, Tokyo, Japan.

--
*

Séanadh Ríomhphoist/Email Disclaimer

*Tá an ríomhphost seo agus aon
chomhad a sheoltar leis faoi rún agus is lena úsáid ag an seolaí agus sin
amháin é. Is féidir tuilleadh a léamh anseo. *

This e-mail and any files
transmitted with it are confidential and are intended solely for use by the
addressee. Read more here.

--

https://www.facebook.com/DCU/ https://twitter.com/DCU
https://www.linkedin.com/company/dublin-city-university
https://www.instagram.com/dublincityuniversity/?hl=en
https://www.youtube.com/user/DublinCityUniversity 

*INLG 2024 Tutorial on Human Evaluation of NLP System Quality* *24th September 2024 at INLG 2024, Tokyo* *Available to both in-person and remote attendees. Registration for INLG is open now: *https://amarys-jtb.jp/INLG2024/ <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Famarys-jtb.jp%2FINLG2024%2F&data=05%7C02%7Cc.thomson%40ABDN.AC.UK%7C019744129e4b467ca76908dcd579893d%7C8c2b19ad5f9c49d490773ec3cfc52b3f%7C0%7C0%7C638619965949210013%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=XXKU7ypzrdDiQxXeH%2FD1pH2WC2mbzOZpHNont72tEpE%3D&reserved=0> *We will release all slides and colab notebooks from the tutorial: * https://human-evaluation-tutorial.github.io/ <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhuman-evaluation-tutorial.github.io%2F&data=05%7C02%7Cc.thomson%40ABDN.AC.UK%7C019744129e4b467ca76908dcd579893d%7C8c2b19ad5f9c49d490773ec3cfc52b3f%7C0%7C0%7C638619965949223360%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=Gt36vXFjiGtSOR%2F1VLKlQHcDsnxyFeQ96e4TBNjlukk%3D&reserved=0> Anya Belz† João Sedoc⚘ Craig Thomson† Simon Mille† Rudali Huidrom† †ADAPT Research Centre, Dublin City University, Ireland ⚘New York University, USA Description: Human evaluation has always been considered the most reliable form of evaluation in Natural Language Processing (NLP), but recent research has thrown up a number of concerning issues, including in the design (Belz et al., 2020; Howcroft et al., 2020) and execution (Thomson et al., 2024) of human evaluation experiments. Standardisation and comparability across different experiments is low, as is reproducibility in the sense that repeat runs of the same evaluation often do not support the same main conclusions, quite apart from not producing similar scores. The current situation is likely to be in part due to how human evaluation is viewed in NLP: not as something that needs to be studied and learnt before venturing into conducting an evaluation experiment, but something that anyone can throw together without prior knowledge by pulling in a couple of students from the lab next door. Our aim with this tutorial is primarily to inform participants about the range of options available and choices that need to be made when creating human evaluation experiments, and what the implications of different decisions are. Moreover, we will present best practice principles and practical tools that help researchers design scientifically rigorous, informative and reliable experiments. As the next section indicates we are planning for a morning of presentations and brief exercises, followed by a practical session in the afternoon where participants will be supported in creating evaluation experiments and analysing results from them, using tools and other resources provided by the tutorial team. We aim to address all aspects of human evaluation of system outputs in a research setting, equipping participants with the knowledge, tools, resources and hands-on experience needed to design and execute rigorous and reliable human evaluation experiments. Take-home materials and online resources will continue to support participants in conducting experiments after the tutorial. Schedule: *Time Unit #: Topic* 09:30—10:00 Unit 1: Introduction 10:00—10:30 Unit 2: Development and Components of Human Evaluations *10:30—10:45 Break* 10:45—11:45 Unit 3: Quality Criteria and Evaluation Modes 11:45—12:30 Unit 4: Experiment Design *12:30—14:00 Lunch* 14:00—15:15 Unit 5: Statistical Analysis of Results *15:15—15:30 Break* 15:30—16:15 Unit 6: Experiment Implementation 16:15—16:40 Unit 7: Experiment Execution *16:40—16:55 Break* 16:55—18:30 Unit 8: Practical Session Summary paper: Anya Belz, João Sedoc, Craig Thomson, Simon Mille and Rudali Huidrom. 2024. The INLG 2024 Tutorial on Human Evaluation of NLP System Quality: Background, Overall Aims, and Summaries of Taught Units <https://aclanthology.org/2024.inlg-tutorials.1>. In Proceedings of the 17th International Conference on Natural Language Generation, Tokyo, Japan. -- * *Séanadh Ríomhphoist/Email Disclaimer* *Tá an ríomhphost seo agus aon chomhad a sheoltar leis faoi rún agus is lena úsáid ag an seolaí agus sin amháin é. Is féidir tuilleadh a léamh anseo. * *This e-mail and any files transmitted with it are confidential and are intended solely for use by the addressee. Read more here.* * -- <https://www.facebook.com/DCU/> <https://twitter.com/DCU> <https://www.linkedin.com/company/dublin-city-university> <https://www.instagram.com/dublincityuniversity/?hl=en> <https://www.youtube.com/user/DublinCityUniversity>