Uncovering Machine Translationese Using Corpus Analysis Techniques to Distinguish between Original and Machine-Translated French
2021
This paper investigates the linguistic characteristics of English to French machine-translatedtexts in comparison with French original, untranslated texts in order to uncover what has been called “machine translationese”. In the same vein as corpus-based translation studies which have focused on human-translated texts, and using a corpus-based statistical approach (Principal Component Analysis), we analyzed a ca. 1.8-million-word corpus of English to French translations of press texts, corresponding to the output of four machine translation systems: one statistical (SMT) and three neural (NMT) systems, namely DeepL, Google Translate, and the European Commission’s eTranslation MT tool, in both its SMT and NMT versions. In particular, to complement a previous study on language-specific features in French(e.g. derived adverbs, existential constructions, coordinator et, preposition avec), a series of language-independent linguistic features were extracted for each text in our corpus, ranging from superficial text characteristics such as average word and sentence length to frequencies of closed class lexical categories and measures of lexical diversity. Our results, which compare the machine-translated data with a corpus of French untranslated data, allow us to uncoverlinguistic features in French machine-translated texts that clearly deviate from the observed norms in original French (e.g.average sentence length, ngram features, lexicaldiversity), and which might serve as information for the post-diting process in order to optimize translation quality.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
0
References
0
Citations
NaN
KQI