Randomised Language Modelling for Statistical Machine Translation

David Talbot,Miles Osborne

Randomised Language Modelling for Statistical Machine Translation

2007

A Bloom filter (BF) is a randomised data structure for set membership queries. Its space requirements are significantly below lossless information-theoretic lower bounds but it produces false positives with some quantifiable probability. Here we explore the use of BFs for language modelling in statistical machine translation. We show how a BF containing n-grams can enable us to use much larger corpora and higher-order models complementing a conventional n-gram LM within an SMT system. We also consider (i) how to include approximate frequency information efficiently within a BF and (ii) how to reduce the error rate of these models by first checking for lower-order sub-sequences in candidate ngrams. Our solutions in both cases retain the one-sided error guarantees of the BF while takingadvantageof theZipf-likedistribution of word frequencies to reduce the space requirements.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations