Training Data Augmentation for Code-Mixed Translation

Abhirut Gupta,Aditya Vavre,Sunita Sarawagi

Training Data Augmentation for Code-Mixed Translation

2021

Abhirut Gupta
Aditya Vavre
Sunita Sarawagi

Machine translation of user-generated code-mixed inputs to English is of crucial importance in applications like web search and targeted advertising. We address the scarcity of parallel training data for training such models by designing a strategy of converting existing non-code-mixed parallel data sources to code-mixed parallel data. We present an m-BERT based procedure whose core learnable component is a ternary sequence labeling model, that can be trained with a limited code-mixed corpus alone. We show a 5.8 point increase in BLEU on heavily code-mixed sentences by training a translation model using our data augmentation strategy on an Hindi-English code-mixed translation task.

Keywords:

Machine translation
Computer science
Translation (geometry)
Code (cryptography)
task
Component (UML)
Point (typography)
Natural language processing
Targeted advertising
Core (game theory)
Artificial intelligence

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations