Learning to Extract Katakana-English Word Pairs from Non-Aligned Web Queries Using a Noisy-Channel Model of Back-Transliteration

E. Brill,Gary Kacmarcik,Chris Brockett,Eric D Brill

Learning to Extract Katakana-English Word Pairs from Non-Aligned Web Queries Using a Noisy-Channel Model of Back-Transliteration

2001

E. Brill
Gary Kacmarcik
Chris Brockett
Eric D Brill

This paper describes a method of extracting katakana words and phrases, along with their English counterparts from non-aligned monolingual web search engine query logs. The method employs a trainable edit distance function to find pairs that have a high probability of being equivalent. These pairs can then be used to further bootstrap training of the edit distance function, resulting in improved back-transliteration from katakana to English. In addition, this is an effective method for mining large numbers of katakana strings to enhance a bilingual lexicon. The improved edit distance function and enhanced lexicon can be used for more accurate alignment of bitexts, and for application during runtime MT and multilingual IR.

Keywords:

Noisy channel model
Speech recognition
Katakana
Transliteration
Natural language processing
Computer science
Artificial intelligence

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations