Departamento de Nosotros: How Machine Translated Corpora Affects Language Models in MRC Tasks.

2020 
Pre-training large-scale language models (LMs) requires huge amounts of text corpora. LMs for English enjoy ever growing corpora of diverse language resources. However, less resourced lan- guages and their mono- and multilingual LMs often struggle to ob- tain bigger datasets. A typical approach in this case implies using machine translation of English corpora to a target language. In this work, we study the caveats of applying directly translated corpora for fine-tuning LMs for downstream natural language processing tasks and demonstrate that careful curation along with post-processing lead to improved performance and overall LMs robustness. In the empirical evaluation, we perform a comparison of directly translated against curated Spanish SQuAD datasets on both user and system levels. Further experimental results on XQuAD and MLQA down- stream transfer-learning question answering tasks show that presum- ably multilingual LMs exhibit more resilience to machine translation artifacts in terms of the exact match score.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    10
    References
    2
    Citations
    NaN
    KQI
    []