A study of local and global thresholding techniques in text categorization

2006 
Feature Filtering is an approach that is widely used for dimensionality reduction in text categorization. In this approach feature scoring methods are used to evaluate features leading to selection. Thresholding is then applied to select the highest scoring features either locally or globally. In this paper, we investigate several local and global feature selection methods. The usage of Standard Deviation (STD) and Maximum Deviation (MD) as globalization schemes is suggested. This work provides a comparative study among fourteen thresholding techniques using different scoring methods and benchmark datasets of diverse nature. This includes investigation of normalizing feature scores before combining them in the global pool. The results suggest that normalized MD outperforms other methods in thresholding Document Frequency (DF) scores using even and moderate diverse data-sets. Furthermore, the results indicated that normalizing feature scores improves the performance of rare categories and balances the bias of some techniques to frequent categories.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    44
    References
    10
    Citations
    NaN
    KQI
    []