Call for Participation DSTC11 - Track4

LF
Luis Fernando D'Haro
Mon, Nov 21, 2022 8:38 PM

Track 4: Robust and Multilingual Automatic Evaluation Metrics for
Open-Domain Dialogue Systems - Eleventh Dialog System Technology Challenge
(DSTC11.T4)

Call for Participation

TRACK GOALS AND DETAILS: Two main goals and tasks:•    Task 1: Propose
and develop effective Automatic Metrics for evaluation of open-domain
multilingual dialogs.
•    Task 2: Propose and develop Robust Metrics for dialogue systems
trained with back translated and paraphrased dialogs in English.

EXPECTED PROPERTIES OF THE PROPOSED METRICS:•    High correlation with
human annotated assessments.
•    Explainable metrics in terms of the quality of the model-generated
responses.
•    Participants can propose their own metric or optionally improve the
baseline evaluation metric deep AM-FM (Zhang et al, 2020).

TASK 1: METRICS FOR MULTILINGUAL DATAIn this task, the goal for
participants is to propose a single metric model effective for the
automatic evaluation of multilingual dialogs in English, Spanish and
Chinese. The model will provide scores to obtain high correlations with
human-annotations.
Participants are expected to use pre-trained or fine-tune multilingual
models and train them to predict multidimensional quality metrics by using
self-supervised techniques.

TASK 2: ROBUST METRICSIn this task, the goal for participants is to
propose robust metrics for automatic evaluation when dealing with English
sentences that have been back translated or automatically paraphrased.
Here, robustness is understood when using sentences having the same
semantic meaning as the original sentence but different wording.
The proposed metric model will be evaluated when comparing the scores
produced on the original sentences w.r.t. with the scores produced when
using the back-translated/paraphrased sentences. Therefore, the expected
performance must be on par with the correlations with human-annotations
obtained over the original sentences.

*DATASETS:*For training: Up to 18 Human-Human curated multilingual datasets
(+3M turns), with turn/dialogue level automatic annotations including QE
metrics or toxicity.
Dev/Test: Up to 10 Human-Chatbot curated multilingual datasets (+150k
turns), with turn/dialogue level human annotations.

*REGISTRATION AND FURTHER INFORMATION:*ChatEval: https://chateval.org/dstc11
GitHub:
https://github.com/Mario-RC/dstc11_track4_robust_multilingual_metrics

*PROPOSED SCHEDULE:*Training/Validation data release: From November to
December in 2022
Test data release: Middle of March in 2023
Entry submission deadline: Middle of March in 2023
Submission of final results: End of March in 2023
Final result announcement: Early of April in 2023
Paper submission: From March to May in 2023
Workshop: July-September/2023 in a venue to be announced with DSTC11

*ORGANIZATIONS:*Universidad Politécnica de Madrid (Spain)
National University of Singapore (Singapore)
Tencent AI Lab (China)
New York University (USA)
Carnegie Mellon University (USA)

Mario Rodríguez Cantelar
Postgraduate Non-Doctoral Researcher / PhD student
Centre for Automation and Robotics (UPM-CSIC)

--
Luis Fernando D'Haro
Profesor Contratado Doctor / Associate Professor
Grupo de Tecnologías del Habla / Speech Technology Group
Dpto. de Ingeniería Electrónica / Dept. of Electronics Engineering
Escuela Técnica Superior de Ingeniería de Telecomunicación
Universidad Politécnica de Madrid
Avenida Complutense nº 30, Ciudad Universitaria, 28040 - Madrid (España).
Despacho/Room: B-108
Teléfono/Phone: (+34) 910672174
Homepage: http://gth.die.upm.es/~lfdharo

*Track 4: Robust and Multilingual Automatic Evaluation Metrics for Open-Domain Dialogue Systems - Eleventh Dialog System Technology Challenge (DSTC11.T4)* *Call for Participation* *TRACK GOALS AND DETAILS: Two main goals and tasks:*• Task 1: Propose and develop effective Automatic Metrics for evaluation of open-domain multilingual dialogs. • Task 2: Propose and develop Robust Metrics for dialogue systems trained with back translated and paraphrased dialogs in English. *EXPECTED PROPERTIES OF THE PROPOSED METRICS:*• High correlation with human annotated assessments. • Explainable metrics in terms of the quality of the model-generated responses. • Participants can propose their own metric or optionally improve the baseline evaluation metric deep AM-FM (Zhang et al, 2020). *TASK 1: METRICS FOR MULTILINGUAL DATA*In this task, the goal for participants is to propose a single metric model effective for the automatic evaluation of multilingual dialogs in English, Spanish and Chinese. The model will provide scores to obtain high correlations with human-annotations. Participants are expected to use pre-trained or fine-tune multilingual models and train them to predict multidimensional quality metrics by using self-supervised techniques. *TASK 2: ROBUST METRICS*In this task, the goal for participants is to propose robust metrics for automatic evaluation when dealing with English sentences that have been back translated or automatically paraphrased. Here, robustness is understood when using sentences having the same semantic meaning as the original sentence but different wording. The proposed metric model will be evaluated when comparing the scores produced on the original sentences w.r.t. with the scores produced when using the back-translated/paraphrased sentences. Therefore, the expected performance must be on par with the correlations with human-annotations obtained over the original sentences. *DATASETS:*For training: Up to 18 Human-Human curated multilingual datasets (+3M turns), with turn/dialogue level automatic annotations including QE metrics or toxicity. Dev/Test: Up to 10 Human-Chatbot curated multilingual datasets (+150k turns), with turn/dialogue level human annotations. *REGISTRATION AND FURTHER INFORMATION:*ChatEval: https://chateval.org/dstc11 GitHub: https://github.com/Mario-RC/dstc11_track4_robust_multilingual_metrics *PROPOSED SCHEDULE:*Training/Validation data release: From November to December in 2022 Test data release: Middle of March in 2023 Entry submission deadline: Middle of March in 2023 Submission of final results: End of March in 2023 Final result announcement: Early of April in 2023 Paper submission: From March to May in 2023 Workshop: July-September/2023 in a venue to be announced with DSTC11 *ORGANIZATIONS:*Universidad Politécnica de Madrid (Spain) National University of Singapore (Singapore) Tencent AI Lab (China) New York University (USA) Carnegie Mellon University (USA) *Mario Rodríguez Cantelar* Postgraduate Non-Doctoral Researcher / PhD student Centre for Automation and Robotics (UPM-CSIC) -- *Luis Fernando D'Haro* Profesor Contratado Doctor / Associate Professor Grupo de Tecnologías del Habla / Speech Technology Group Dpto. de Ingeniería Electrónica / Dept. of Electronics Engineering Escuela Técnica Superior de Ingeniería de Telecomunicación *Universidad Politécnica de Madrid* Avenida Complutense nº 30, Ciudad Universitaria, 28040 - Madrid (España). Despacho/Room: B-108 Teléfono/Phone: (+34) 910672174 *Homepage:* http://gth.die.upm.es/~lfdharo