Uncovering Machine Translationese Using Corpus Analysis Techniques to Distinguish between Original and Machine­-Translated French

2021 
This paper investigates the linguistic characteristics of English to French machine­-translatedtexts in comparison with French original, untranslated texts in order to uncover what has been called “machine translationese”. In the same vein as corpus­-based translation studies which have focused on human­-translated texts, and using a corpus­-based statistical approach (Principal Component Analysis), we analyzed a ca. 1.8­-million­-word corpus of English to French translations of press texts, corresponding to the output of four machine translation sy­stems: one statistical (SMT) and three neural (NMT) systems, namely DeepL, Google Trans­late, and the European Commission’s eTranslation MT tool, in both its SMT and NMT ver­sions. In particular, to complement a previous study on language­-specific features in French(e.g. derived adverbs, existential constructions, coordinator et, preposition avec), a series of language­-independent linguistic features were extracted for each text in our corpus, ranging from superficial text characteristics such as average word and sentence length to frequencies of closed­ class lexical categories and measures of lexical diversity. Our results, which compare the machine­-translated data with a corpus of French untranslated data, allow us to uncoverlinguistic features in French machine­-translated texts that clearly deviate from the observed norms in original French (e.g.average sentence length, n­gram features, lexicaldiversity), and which might serve as information for the post­-diting process in order to optimize translation quality.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []