Micro blogs Oriented Word Segmentation System

Liu Yijia,Zhang Meishan,Che Wanxiang,Liu Ting,Deng Yihe

Micro blogs Oriented Word Segmentation System

2012

Liu Yijia
Zhang Meishan
Che Wanxiang
Liu Ting
Deng Yihe

We present a Chinese word segmentation system submitted to the first task on CLP 2012 back-offs. Our segmenter is built using a conditional random field sequence model. We set the combination of a few annotated micro blogs and People Daily corpus as the training data. We encode special words detected by rules and information extracted from unlabeled data into features. These features are used to improve our model’s performance. We also derive a micro blog specified lexicon from auto-analyzed data and use lexicon related features to assist the model. When testing on the sample data of this task, these features result in 1.8% improvement over the baseline model. Finally, our model achieves F-score of 94.07% on the bakeoff’s test set.

Keywords:

Lexicon
Training set
Social media
Speech recognition
Conditional random field
Test set
Microblogging
Text segmentation
Pattern recognition
Artificial intelligence
Computer science
ENCODE
Natural language processing
sequence model

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations