Text Representation and Similarity Measure for Text Clustering Based on Semantic Strings: A Case Study on Uyghur Language

2021 
ABSTRACT In Uyghur language, the words which are segmented by inter-word space as natural separator can hardly serve as features in text representation, which leads to the low efficiency of text processing, it is still a research topic how to use language units beyond word boundaries as features to represent texts and improve the efficiency of text processing. This paper proposes a semantic string extraction approach, which is a method for extracting language units beyond word boundaries. At the same time, it also proposes the methods for textual representation and similarity measurement, and verifies its effectiveness in Uyghur text clustering tasks. Specifically, a combination of string expansion and language rules are applied to identify the trusted frequent patterns (TFP) in the text set. Next, semantic strings are evaluated and selected from the text set. Regarding similarity measure, each text is represented as a weighted semantic string set, and a set-based text similarity measuring approach is presented. Finally, the above ideas and approaches are applied to the Uyghur text clustering, and the corresponding clustering algorithms are proposed and verified through series of experiments on the large-scale text corpus. Experimental results show that the semantic string-based text representation is in general very useful in processing Uyghur language.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []