Results of the WMT20 Metrics Shared Task

Nitika Mathur,Johnny Wei,Markus Freitag,Qingsong Ma,Ondřej Bojar

Results of the WMT20 Metrics Shared Task

2020

This paper presents the results of the WMT20 Metrics Shared Task. Participants were asked to score the outputs of the translation systems competing in the WMT20 News Translation Task with automatic metrics. Ten research groups submitted 27 metrics, four of which are reference-less “metrics”. In addition, we computed five baseline metrics, including sentBLEU, BLEU, TER and using the SacreBLEU scorer. All metrics were evaluated on how well they correlate at the system-, document- and segment-level with the WMT20 official human scores. We present an extensive analysis on influence of different reference translations on metric reliability, how well automatic metrics score human translations, and we also flag major discrepancies between metric and human scores when evaluating MT systems. Finally, we investigate whether we can use automatic metrics to flag incorrect human ratings.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations