Specific Web Spider Design for the Extraction of Unknown Chinese Words from BBS Corpus

2009 
Aiming at the low efficiency of unknown words segmentation of Chinese words, this paper presented an improved design of web spider that extracted texts from TianYa BBS in order to construct a better corpus. Then generate unknown words by extracting words from the corpus with a new function which was firstly constructed by Mutual Information function and Duplicated Combination Frequency function. Experiments showed that the improved method was more efficient.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    8
    References
    0
    Citations
    NaN
    KQI
    []