Leveraging network structure for incremental document clustering

2012 
Recent studies have shown that link-based clustering methods can significantly improve the performance of content-based clustering. However, most previous algorithms are developed for fixed data sets, and are not applicable to the dynamic environments such as data warehouse and online digital library. In this paper, we introduce a novel approach which leverages the network structure for incremental clustering. Under this framework, both the link and content information are incorporated to determine the host cluster of a new document. The combination of two types of information ensures a promising performance of the clustering results. Furthermore, the status of core members is used to quickly determine whether to split or merge a new cluster. This filtering process eliminates the unnecessary and time-consuming checks of textual similarity on the whole corpus, and thus greatly speeds up the entire procedure. We evaluate our proposed approach on several real-world publication data sets and conduct an extensive comparison with both the classic content based and the recent link based algorithms. The experimental results demonstrate the effectiveness and efficiency of our method.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    20
    References
    0
    Citations
    NaN
    KQI
    []