Concept-based Vector Space Model for Improving Text Clustering

2012 
In document clustering, it must be more similarity between intra-document and less similarity between intra-document of two clusters. The cosine function measures the similarity of two documents. When the clusters are not well separated, partitioning them just based on the pair wise is not good enough because some documents in different clusters may be similar to each other and the function is not efficient. To solve this problem, a measurement of the similarity in concept of neighbors and links is used. In Vector Space Model (VSM), every vector composed by the feature and its weight represents a document. But TF-IDF has the fault that exceptional useful features may be deleted, and this simple VSM model cannot present semantics well because all columns (terms) are considered independent. Indeed, the VSM model ignores all important semantic/conceptual relations of words. so we make up that by adding the count of the words at the important places and embed the semantic relationship information directly in the weights of the corresponding words which are semantically related by readjusting the weight values through similarity measures. In this way, similarity is used to re-weight term frequency in the VSM. Two clustering algorithms, bisecting k-means, feature weighting k-means clustering algorithm, have been used to cluster real-world text data represented in the new knowledge-based VSM. The experimental results show that the clustering performance based on the new model was much better than that based on the traditional term-based VSM.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    9
    References
    2
    Citations
    NaN
    KQI
    []