Randomized Significance Tests in Machine Translation

Yvette Graham,Nitika Mathur,Timothy Baldwin

Randomized Significance Tests in Machine Translation

2014

Yvette Graham
Nitika Mathur
Timothy Baldwin

Randomized methods of significance testing enable estimation of the probability that an increase in score has occurred simply by chance. In this paper, we examine the accuracy of three randomized methods of significance testing in the context of machine translation: paired bootstrap resampling, bootstrap resampling and approximate randomization. We carry out a large-scale human evaluation of shared task systems for two language pairs to provide a gold standard for tests. Results show very little difference in accuracy across the three methods of significance testing. Notably, accuracy of all test/metric combinations for evaluation of English-to-Spanish are so low that there is not enough evidence to conclude they are any better than a random coin toss.

Keywords:

Machine translation
Statistics
Bootstrapping (statistics)
Metrical task system
Randomization
Data mining
Computer science
randomized methods
Gold standard
Coin flipping
significance testing

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations