An Experimental Study on Feature Selection Using Wikipedia for Text Categorization

2012 
ABSTRACT In text categorization, core terms of an input document are har dly selected as classification features if they do not occur in a training document set. Besid es, synonymous terms with the same concept are usually treated as different features. This study aims to improve text categorization performance by integrating synonyms into a singl e feature and by replacing input terms not in the training document set with the most similar te rm occurring in training documents using Wikipedia. For the selection of classification features, experiments were performed in various settings composed of three different conditions: the us e of category information of non-training terms, the part of Wikipedia used for measuring te rm-term similarity, and the type of similarity measures. The categorization performance of a kNN classifier was improved by 0.35~1.85% in F 1 value in all the experimental settings when non-learning terms were replaced by the learning term with the highest similarity above the thre shold value. Although the improvement ratio is not as high as expected, several semantic as well as structural devices of Wikipedia could be used for selecting more effective classification features.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    13
    References
    2
    Citations
    NaN
    KQI
    []