language-icon Old Web
English
Sign In

Evaluation of machine translation

Various methods for the evaluation for machine translation have been employed. This article focuses on the evaluation of the output of machine translation, rather than on performance or usability evaluation. Various methods for the evaluation for machine translation have been employed. This article focuses on the evaluation of the output of machine translation, rather than on performance or usability evaluation. A typical way for lay people to assess machine translation quality is to translate from a source language to a target language and back to the source language with the same engine. Though intuitively this may seem like a good method of evaluation, it has been shown that round-trip translation is a 'poor predictor of quality'. The reason why it is such a poor predictor of quality is reasonably intuitive. A round-trip translation is not testing one system, but two systems: the language pair of the engine for translating into the target language, and the language pair translating back from the target language. Consider the following examples of round-trip translation performed from English to Italian and Portuguese from Somers (2005): In the first example, where the text is translated into Italian then back into English—the English text is significantly garbled, but the Italian is a serviceable translation. In the second example, the text translated back into English is perfect, but the Portuguese translation is meaningless. While round-trip translation may be useful to generate a 'surplus of fun,' the methodology is deficient for serious study of machine translation quality. This section covers two of the large scale evaluation studies that have had significant impact on the field—the ALPAC 1966 study and the ARPA study. One of the constituent parts of the ALPAC report was a study comparing different levels of human translation with machine translation output, using human subjects as judges. The human judges were specially trained for the purpose. The evaluation study compared an MT system translating from Russian into English with human translators, on two variables. The variables studied were 'intelligibility' and 'fidelity'. Intelligibility was a measure of how 'understandable' the sentence was, and was measured on a scale of 1–9. Fidelity was a measure of how much information the translated sentence retained compared to the original, and was measured on a scale of 0–9. Each point on the scale was associated with a textual description. For example, 3 on the intelligibility scale was described as 'Generally unintelligible; it tends to read like nonsense but, with a considerable amount of reflection and study, one can at least hypothesize the idea intended by the sentence'.

[ "Machine translation software usability" ]
Parent Topic
Child Topic
    No Parent Topic