A similarity assessment technique for effective grouping of documents

2015 
Display Omitted Document clustering refers to the task of grouping similar documents and segregating dissimilar documents. It is very useful to find meaningful categories from a large corpus. In practice, the task to categorize a corpus is not so easy, since it generally contains huge documents and the document vectors are high dimensional. This paper introduces a hybrid document clustering technique by combining a new hierarchical and the traditional k-means clustering techniques. A distance function is proposed to find the distance between the hierarchical clusters. Initially the algorithm constructs some clusters by the hierarchical clustering technique using the new distance function. Then k-means algorithm is performed by using the centroids of the hierarchical clusters to group the documents that are not included in the hierarchical clusters. The major advantage of the proposed distance function is that it is able to find the nature of the corpora by varying a similarity threshold. Thus the proposed clustering technique does not require the number of clusters prior to executing the algorithm. In this way the initial random selection of k centroids for k-means algorithm is not needed for the proposed method. The experimental evaluation using Reuter, Ohsumed and various TREC data sets shows that the proposed method performs significantly better than several other document clustering techniques. F-measure and normalized mutual information are used to show that the proposed method is effectively grouping the text data sets.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    36
    References
    21
    Citations
    NaN
    KQI
    []