ReproGen: A Shared Task on Reproducibility of Human Evaluations in NLG: Final Call for Participation

AS
Agarwal, Shubham
Wed, Jun 23, 2021 3:32 PM

ReproGen: A Shared Task on Reproducibility of Human Evaluations in NLG: Final Call for Participation
Results meeting at INLG’21 in Aberdeen, 20-24 September 2021
https://reprogen.github.io


Final Call for Participation

ReproGen Shared Task on Reproducibility of Human Evaluations in NLG
Results meeting at INLG’21 in Aberdeen, 20-24 September 2021


Background

Across Natural Language Processing (NLP), a growing body of work is exploring the issue of reproducibility in machine learning contexts. However, the reproducibility of the results of human evaluation experiments is currently under-addressed. This is of concern because human evaluations provide the benchmarks against which automatic evaluation methods are assessed across NLP, and are moreover widely regarded as the standard form of evaluation in NLG.

With the ReproGen shared task on the reproducibility of human evaluations in NLG we aim (i) to shed light on the extent to which past NLG evaluations have been reproducible, and (ii) to draw conclusions regarding how human evaluations can be designed and reported to increase reproducibility. If the task is run over several years, we hope to be able to document an overall increase in levels of reproducibility over time.


About ReproGen

Following discussion of the ReproGen proposal at the INLG’20 GenChal session, we are organizing ReproGen with two tracks, one an ‘unshared task’ in which teams attempt to reproduce their own prior human evaluation results (Track B below), the other a shared task in which teams repeat existing human evaluation studies with the aim of reproducing their results (Track A):

A. Main Reproducibility Track: For a shared set of selected human evaluation studies, participants repeat one or more studies, and attempt to reproduce their results, using published information plus additional information and resources provided by the authors, and making common sense assumptions where information is still incomplete.

B. RYO Track: Reproduce Your Own previous human evaluation results, and report what happened. Unshared task.


Track A Papers

Following a call for proposals, we have selected the papers listed below for inclusion in ReproGen Track A. The authors have agreed to human evaluation studies from their papers as identified below to be used for reproduction studies, and have committed to supporting participants in reproduction studies by providing the system outputs to be evaluated and any reusable tools that were used in evaluations, and by being available for questions during the shared task period. Moreover, all authors have completed the ReproGen Human Evaluation Sheet which we will use as the standard for establishing similarity between different human evaluation studies.

The papers and studies, with many thanks to the authors for supporting ReproGen, are:

  • Van der Lee et al. (2017): PASS: A Dutch data-to-text system for soccer, targeted towards specific audiences: (https://www.aclweb.org/anthology/W17-3513.pdf)
    [1 evaluation study; Dutch; 20 evaluators; 1 quality criterion; reproduction target: primary scores]

  • Dusek et al. (2018): Findings of the E2E NLG Challenge: (https://www.aclweb.org/anthology/W18-6539.pdf)
    [1 evaluation study; English; MTurk; 2 quality criteria; reproduction target: primary scores]

  • Qader et al. (2018): Generation of Company descriptions using concept-to-text and text-to-text deep models: dataset collection and systems evaluation: (https://www.aclweb.org/anthology/W18-6532.pdf)
    [only evaluation study in the paper; English; 19 evaluators; 4 quality criteria; reproduction target: primary scores]

  • Shaikh & Santhanam (2019): Towards Best Experiment Design for Evaluating Dialogue System Output: (https://www.aclweb.org/anthology/W19-8610.pdf)
    [3 evaluation studies differing in experimental design; English; x evaluators; 2 quality criteria; reproduction target: correlation scores between 3 studies]


Track A and B instructions

Step 1. Fill in the registration form (https://forms.gle/pBPRjPGwoKCY3hsf7), indicating which of the above papers, or which of your own papers, you wish to carry out a reproduction study for.

Step 2. The ReproGen participant's pack will be made available to you, plus data, tools, and other materials for each of the studies you have selected in the registration form.

Step 3. Carry out the reproduction, and submit a report of up to 8 pages plus references and supplementary material including a completed ReproGen Human Evaluation Sheet for each reproduction study, by August 15th, 2021.

Step 4. The organizers will carry out light-touch review of the evaluation reports according to the following criteria:

  • Evaluation sheet has been completed.
  • Exact repetition of the study has been attempted and is described in the report.
  • Report gives full details of the reproduction study, in accordance with the reporting guidelines provided.
  • All tools and resources used in the study are publicly available.

Step 5. Present paper at the results meeting.

Full details and instructions will be provided in the ReproGen participant’s pack.


Important Dates

28 Jan 2021: Announcement and Call for Human Evaluations to be Reproduced
15 Feb 2021: Submission deadline for proposals of human evaluations
27 Feb 2021: First Call for Participation and registration opens
10 April 2021: Second Call for Participation
15 Aug 2021: Submission deadline for reproduction reports
20-24 Sep 2021: Results presented at INLG (conference dates to be confirmed)


Organizers

Anya Belz, University of Brighton, UK
Shubham Agarwal, Heriot-Watt University, UK
Anastasia Shimorina, Université de Lorraine / LORIA
Ehud Reiter, University of Aberdeen, UK


Contact

reprogen.task@gmail.com
https://reprogen.github.io
https://shubhamagarwal92.github.io/


Founded in 1821, Heriot-Watt is a leader in ideas and solutions. With campuses and students across the entire globe we span the world, delivering innovation and educational excellence in business, engineering, design and the physical, social and life sciences. This email is generated from the Heriot-Watt University Group, which includes:

  1. Heriot-Watt University, a Scottish charity registered under number SC000278
  2. Heriot- Watt Services Limited (Oriam), Scotland's national performance centre for sport. Heriot-Watt Services Limited is a private limited company registered is Scotland with registered number SC271030 and registered office at Research & Enterprise Services Heriot-Watt University, Riccarton, Edinburgh, EH14 4AS.

The contents (including any attachments) are confidential. If you are not the intended recipient of this e-mail, any disclosure, copying, distribution or use of its contents is strictly prohibited, and you should please notify the sender immediately and then delete it (including any attachments) from your system.

ReproGen: A Shared Task on Reproducibility of Human Evaluations in NLG: Final Call for Participation Results meeting at INLG’21 in Aberdeen, 20-24 September 2021 https://reprogen.github.io -------------------------------------------- Final Call for Participation -------------------------------------------- ReproGen Shared Task on Reproducibility of Human Evaluations in NLG Results meeting at INLG’21 in Aberdeen, 20-24 September 2021 -------------------------------------------- Background -------------------------------------------- Across Natural Language Processing (NLP), a growing body of work is exploring the issue of reproducibility in machine learning contexts. However, the reproducibility of the results of human evaluation experiments is currently under-addressed. This is of concern because human evaluations provide the benchmarks against which automatic evaluation methods are assessed across NLP, and are moreover widely regarded as the standard form of evaluation in NLG. With the ReproGen shared task on the reproducibility of human evaluations in NLG we aim (i) to shed light on the extent to which past NLG evaluations have been reproducible, and (ii) to draw conclusions regarding how human evaluations can be designed and reported to increase reproducibility. If the task is run over several years, we hope to be able to document an overall increase in levels of reproducibility over time. -------------------------------------------- About ReproGen -------------------------------------------- Following discussion of the ReproGen proposal at the INLG’20 GenChal session, we are organizing ReproGen with two tracks, one an ‘unshared task’ in which teams attempt to reproduce their own prior human evaluation results (Track B below), the other a shared task in which teams repeat existing human evaluation studies with the aim of reproducing their results (Track A): A. Main Reproducibility Track: For a shared set of selected human evaluation studies, participants repeat one or more studies, and attempt to reproduce their results, using published information plus additional information and resources provided by the authors, and making common sense assumptions where information is still incomplete. B. RYO Track: Reproduce Your Own previous human evaluation results, and report what happened. Unshared task. -------------------------------------------- Track A Papers -------------------------------------------- Following a call for proposals, we have selected the papers listed below for inclusion in ReproGen Track A. The authors have agreed to human evaluation studies from their papers as identified below to be used for reproduction studies, and have committed to supporting participants in reproduction studies by providing the system outputs to be evaluated and any reusable tools that were used in evaluations, and by being available for questions during the shared task period. Moreover, all authors have completed the ReproGen Human Evaluation Sheet which we will use as the standard for establishing similarity between different human evaluation studies. The papers and studies, with many thanks to the authors for supporting ReproGen, are: - Van der Lee et al. (2017): PASS: A Dutch data-to-text system for soccer, targeted towards specific audiences: (https://www.aclweb.org/anthology/W17-3513.pdf) [1 evaluation study; Dutch; 20 evaluators; 1 quality criterion; reproduction target: primary scores] - Dusek et al. (2018): Findings of the E2E NLG Challenge: (https://www.aclweb.org/anthology/W18-6539.pdf) [1 evaluation study; English; MTurk; 2 quality criteria; reproduction target: primary scores] - Qader et al. (2018): Generation of Company descriptions using concept-to-text and text-to-text deep models: dataset collection and systems evaluation: (https://www.aclweb.org/anthology/W18-6532.pdf) [only evaluation study in the paper; English; 19 evaluators; 4 quality criteria; reproduction target: primary scores] - Shaikh & Santhanam (2019): Towards Best Experiment Design for Evaluating Dialogue System Output: (https://www.aclweb.org/anthology/W19-8610.pdf) [3 evaluation studies differing in experimental design; English; x evaluators; 2 quality criteria; reproduction target: correlation scores between 3 studies] -------------------------------------------- Track A and B instructions -------------------------------------------- Step 1. Fill in the registration form (https://forms.gle/pBPRjPGwoKCY3hsf7), indicating which of the above papers, or which of your own papers, you wish to carry out a reproduction study for. Step 2. The ReproGen participant's pack will be made available to you, plus data, tools, and other materials for each of the studies you have selected in the registration form. Step 3. Carry out the reproduction, and submit a report of up to 8 pages plus references and supplementary material including a completed ReproGen Human Evaluation Sheet for each reproduction study, by August 15th, 2021. Step 4. The organizers will carry out light-touch review of the evaluation reports according to the following criteria: - Evaluation sheet has been completed. - Exact repetition of the study has been attempted and is described in the report. - Report gives full details of the reproduction study, in accordance with the reporting guidelines provided. - All tools and resources used in the study are publicly available. Step 5. Present paper at the results meeting. Full details and instructions will be provided in the ReproGen participant’s pack. -------------------------------------------- Important Dates -------------------------------------------- 28 Jan 2021: Announcement and Call for Human Evaluations to be Reproduced 15 Feb 2021: Submission deadline for proposals of human evaluations 27 Feb 2021: First Call for Participation and registration opens 10 April 2021: Second Call for Participation 15 Aug 2021: Submission deadline for reproduction reports 20-24 Sep 2021: Results presented at INLG (conference dates to be confirmed) -------------------------------------------- Organizers -------------------------------------------- Anya Belz, University of Brighton, UK Shubham Agarwal, Heriot-Watt University, UK Anastasia Shimorina, Université de Lorraine / LORIA Ehud Reiter, University of Aberdeen, UK -------------------------------------------- Contact -------------------------------------------- reprogen.task@gmail.com https://reprogen.github.io <https://shubhamagarwal92.github.io/> ________________________________ Founded in 1821, Heriot-Watt is a leader in ideas and solutions. With campuses and students across the entire globe we span the world, delivering innovation and educational excellence in business, engineering, design and the physical, social and life sciences. This email is generated from the Heriot-Watt University Group, which includes: 1. Heriot-Watt University, a Scottish charity registered under number SC000278 2. Heriot- Watt Services Limited (Oriam), Scotland's national performance centre for sport. Heriot-Watt Services Limited is a private limited company registered is Scotland with registered number SC271030 and registered office at Research & Enterprise Services Heriot-Watt University, Riccarton, Edinburgh, EH14 4AS. The contents (including any attachments) are confidential. If you are not the intended recipient of this e-mail, any disclosure, copying, distribution or use of its contents is strictly prohibited, and you should please notify the sender immediately and then delete it (including any attachments) from your system.