Improving Text Document Clustering by Exploiting Open Web Directory.

2009 
The process of term extraction and weighting affects the performance of information retrieval, search engines and text mining systems. A text document is abstracted as a vector of terms, and the weight for each term is usually given by using popular TF-IDF method. In the TF-IDF method, the weight of a term is a function of its frequency in the document and in overall document collection. The similarity computation by cosine similarity method is influenced by common terms (and their weight) between two document vectors and ignores the semantic relation between terms. We can use the generalization property of hierarchical knowledge repositories to establish that the terms correspond to specific instances of some generalized term. These generalized terms can be used to enrich the document vector, by enriching and weighting we intend to obtain better similarity values between two documents. In this paper, we have proposed an improved term extraction and weighting method by exploiting the contextual/semantic relationship between terms using knowledge repositories such as open web directories. The experiment results show that the proposed approach improves clustering performance over other term extraction and weighting approaches.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    17
    References
    1
    Citations
    NaN
    KQI
    []