More Data Is Better Only to Some Level, After Which It Is Harmful: Profiling Neural Machine Translation Self-learning with Back-Translation.

Rodrigo Santos,João Silva,António Branco

More Data Is Better Only to Some Level, After Which It Is Harmful: Profiling Neural Machine Translation Self-learning with Back-Translation.

2021

Rodrigo Santos
João Silva
António Branco

Neural machine translation needs a very large volume of data to unfold its potential. Self-learning with back-translation became widely adopted to address this data scarceness bottleneck: a seed system is used to translate source monolingual sentences which are aligned with the output sentences to form a synthetic data set that, when used to retrain the system, improves its translation performance. In this paper we report on the profiling of the self-learning with back-translation aiming at clarifying whether adding more synthetic data always leads to an increase of performance. With the experiments undertaken, we gathered evidence indicating that more synthetic data is better only to some level, after which it is harmful as the translation quality decays.

Keywords:

Translation (geometry)
Machine translation
quality
Volume (computing)
Computer science
Profiling (computer programming)
Natural language processing
Synthetic data
Bottleneck
Artificial intelligence
Set (psychology)

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations