Micro blogs Oriented Word Segmentation System

2012 
We present a Chinese word segmentation system submitted to the first task on CLP 2012 back-offs. Our segmenter is built using a conditional random field sequence model. We set the combination of a few annotated micro blogs and People Daily corpus as the training data. We encode special words detected by rules and information extracted from unlabeled data into features. These features are used to improve our model’s performance. We also derive a micro blog specified lexicon from auto-analyzed data and use lexicon related features to assist the model. When testing on the sample data of this task, these features result in 1.8% improvement over the baseline model. Finally, our model achieves F-score of 94.07% on the bakeoff’s test set.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    3
    References
    1
    Citations
    NaN
    KQI
    []