Randomised Language Modelling for Statistical Machine Translation

2007 
A Bloom filter (BF) is a randomised data structure for set membership queries. Its space requirements are significantly below lossless information-theoretic lower bounds but it produces false positives with some quantifiable probability. Here we explore the use of BFs for language modelling in statistical machine translation. We show how a BF containing n-grams can enable us to use much larger corpora and higher-order models complementing a conventional n-gram LM within an SMT system. We also consider (i) how to include approximate frequency information efficiently within a BF and (ii) how to reduce the error rate of these models by first checking for lower-order sub-sequences in candidate ngrams. Our solutions in both cases retain the one-sided error guarantees of the BF while takingadvantageof theZipf-likedistribution of word frequencies to reduce the space requirements.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    9
    References
    81
    Citations
    NaN
    KQI
    []