Chinese text segmentation for text retrieval: achievements and problems

Zimin Wu,Gwyneth Tseng

Chinese text segmentation for text retrieval: achievements and problems

1993

Present text retrieval systems are generally built on the reductionist basis that words in texts (keywords) are used as indexing terms to represent the texts. A necessary precursor to these systems is word extraction which, for English texts, can be achieved automatically by using spaces and punctuations as word delimiters. This cannot be readily applied to Chinese texts because they do not have obvious word boundaries. A Chinese text consists of a linear sequence of nonspaced or equally spaced ideographic characters, which are similar to morphemes in English. Researchers of Chinese text retrieval have been seeking methods of text segmentation to divide Chinese texts automatically into words. First, a review of these methods is provided in which the various different approaches to Chinese text segmentation have been classified in order to provide a general picture of the research activity in this area. Some of the most important work is described. There follows a discussion of the problems of Chinese text segmentation with examples to illustrate. These problems include morphological complexities, segmentation ambiguity, and parsing problems, and demonstrate that text segmentation remains one of the most challenging and interesting areas for Chinese text retrieval. © 1993 John Wiley & Sons, Inc.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

124

Citations