Track 4: Robust and Multilingual Automatic Evaluation Metrics for
Open-Domain Dialogue Systems - Eleventh Dialog System Technology Challenge
(DSTC11.T4)
Call for Participation
**************** Baseline Model is now available !! ***************
TRACK GOALS AND DETAILS: Two main goals and tasks:
• Task 1: Propose and develop effective Automatic Metrics for evaluation
of open-domain multilingual dialogs.
• Task 2: Propose and develop Robust Metrics for dialogue systems
trained with back translated and paraphrased dialogs in English.
EXPECTED PROPERTIES OF THE PROPOSED METRICS:
• High correlation with human annotated assessments.
• Explainable metrics in terms of the quality of the model-generated
responses.
• Participants can propose their own metric or optionally improve the
baseline evaluation metric deep AM-FM (Zhang et al, 2020).
DATASETS:
For training: Up to 18 Human-Human curated multilingual datasets (+3M
turns), with turn/dialogue level automatic annotations as toxicity or
sentiment analisys, among others.
Dev/Test: Up to 10 Human-Chatbot curated multilingual datasets (+150k
turns), with turn/dialogue level human annotations including QE metrics or
cosine similarity.
Data translated and back-translated into several languages (English,
Spanish and Chinese). Also, there are several paraphrases with annotations
for each dataset.
BASELINE MODEL:
The default choice is Deep AM-FM (Zhang et al, 2020). This model has been
adapted to be able to evaluate multilingual datasets, as well as to work
with paraphrased and backtranslated sentences.
GitHub: https://github.com/karthik19967829/DSTC11-Benchmark
REGISTRATION AND FURTHER INFORMATION:
ChatEval: https://chateval.org/dstc11
GitHub: https://github.com/Mario
-RC/dstc11_track4_robust_multilingual_metrics
PROPOSED SCHEDULE:
Training/Validation data release: From November to December in 2022
Test data release: Middle of March in 2023
Entry submission deadline: Middle of March in 2023
Submission of final results: End of March in 2023
Final result announcement: Early of April in 2023
Paper submission: From March to May in 2023
Workshop: July-September/2023 in a venue to be announced with DSTC11
ORGANIZATIONS:
Universidad Politécnica de Madrid (Spain)
National University of Singapore (Singapore)
Tencent AI Lab (China)
New York University (USA)
Carnegie Mellon University (USA)
Mario Rodríguez Cantelar
Postgraduate Non-Doctoral Researcher / PhD student
Centre for Automation and Robotics (UPM-CSIC)
--
Luis Fernando D'Haro
Profesor Contratado Doctor / Associate Professor
Grupo de Tecnología del Habla y Aprendizaje Automático / Speech Technology
and Machine Learning Group
Dpto. de Ingeniería Electrónica / Dept. of Electronics Engineering
Escuela Técnica Superior de Ingeniería de Telecomunicación
Universidad Politécnica de Madrid
Avenida Complutense nº 30, Ciudad Universitaria, 28040 - Madrid (España).
Despacho/Room: B-108
Teléfono/Phone: (+34) 910672174
Homepage: http://gth.die.upm.es/~lfdharo
*Track 4: Robust and Multilingual Automatic Evaluation Metrics for
Open-Domain Dialogue Systems - Eleventh Dialog System Technology Challenge
(DSTC11.T4)*
*Call for Participation*
*******************************************************************
**************** *Baseline Model is now available !!* ***************
*******************************************************************
*TRACK GOALS AND DETAILS: Two main goals and tasks:*
• Task 1: Propose and develop effective Automatic Metrics for evaluation
of open-domain multilingual dialogs.
• Task 2: Propose and develop Robust Metrics for dialogue systems
trained with back translated and paraphrased dialogs in English.
*EXPECTED PROPERTIES OF THE PROPOSED METRICS:*
• High correlation with human annotated assessments.
• Explainable metrics in terms of the quality of the model-generated
responses.
• Participants can propose their own metric or optionally improve the
baseline evaluation metric deep AM-FM (Zhang et al, 2020).
*DATASETS:*
For training: Up to 18 Human-Human curated multilingual datasets (+3M
turns), with turn/dialogue level automatic annotations as toxicity or
sentiment analisys, among others.
Dev/Test: Up to 10 Human-Chatbot curated multilingual datasets (+150k
turns), with turn/dialogue level human annotations including QE metrics or
cosine similarity.
Data translated and back-translated into several languages (English,
Spanish and Chinese). Also, there are several paraphrases with annotations
for each dataset.
*BASELINE MODEL:*
The default choice is Deep AM-FM (Zhang et al, 2020). This model has been
adapted to be able to evaluate multilingual datasets, as well as to work
with paraphrased and backtranslated sentences.
GitHub: https://github.com/karthik19967829/DSTC11-Benchmark
*REGISTRATION AND FURTHER INFORMATION:*
ChatEval: https://chateval.org/dstc11
GitHub: https://github.com/Mario
-RC/dstc11_track4_robust_multilingual_metrics
*PROPOSED SCHEDULE:*
Training/Validation data release: From November to December in 2022
Test data release: Middle of March in 2023
Entry submission deadline: Middle of March in 2023
Submission of final results: End of March in 2023
Final result announcement: Early of April in 2023
Paper submission: From March to May in 2023
Workshop: July-September/2023 in a venue to be announced with DSTC11
*ORGANIZATIONS:*
Universidad Politécnica de Madrid (Spain)
National University of Singapore (Singapore)
Tencent AI Lab (China)
New York University (USA)
Carnegie Mellon University (USA)
*Mario Rodríguez Cantelar*
Postgraduate Non-Doctoral Researcher / PhD student
Centre for Automation and Robotics (UPM-CSIC)
--
*Luis Fernando D'Haro*
Profesor Contratado Doctor / Associate Professor
Grupo de Tecnología del Habla y Aprendizaje Automático / Speech Technology
and Machine Learning Group
Dpto. de Ingeniería Electrónica / Dept. of Electronics Engineering
Escuela Técnica Superior de Ingeniería de Telecomunicación
*Universidad Politécnica de Madrid*
Avenida Complutense nº 30, Ciudad Universitaria, 28040 - Madrid (España).
Despacho/Room: B-108
Teléfono/Phone: (+34) 910672174
*Homepage:* http://gth.die.upm.es/~lfdharo