Microblogs as Parallel Corpora

Wang Ling,Guang Xiang,Chris Dyer,Alan W. Black,Isabel Trancoso

Microblogs as Parallel Corpora

2013

Wang Ling
Guang Xiang
Chris Dyer
Alan W. Black
Isabel Trancoso

In the ever-expanding sea of microblog data, there is a surprising amount of naturally occurring parallel text: some users create post multilingual messages targeting international audiences while others “retweet” translations. We present an efficient method for detecting these messages and extracting parallel segments from them. We have been able to extract over 1M Chinese-English parallel segments from Sina Weibo (the Chinese counterpart of Twitter) using only their public APIs. As a supplement to existing parallel training data, our automatically extracted parallel data yields substantial translation quality improvements in translating microblog text and modest improvements in translating edited news commentary. The resources in described in this paper are available at http://www.cs.cmu.edu/ lingwang/utopia.

Keywords:

World Wide Web
Natural language processing
Training set
Social media
Artificial intelligence
Computer science
Microblogging
Information retrieval
parallel corpora

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations