Topic-based Classification through Unigram Unmasking

2018 
Abstract Text classification (TC) is the task of automatically assigning documents to a fixed number of categories. TC is an important component in many text applications such as text indexing, information extraction, information retrieval, text mining, and word sense disambiguation. In this paper, we present an alternative method of feature reduction - a concept we call unigram unmasking. Previous text classification approaches have typically focused on a “bag-of-words” vector. We posit that at times some of the most frequent unigrams, which have the greatest weight within these vectors, are not only unnecessary for classification, but can at times even hurt models’ accuracy. We present an approach where a percentage of common unigrams are intentionally removed, thus “unmasking” the added value from less popular unigrams. We present results from a topic-based classification task (hundreds of online free text-books belonging to five domains: Career and Study Advice, Economics and Finance, IT Programming, Natural Sciences, Statistics sand Mathematics) and show that unmasking was helpful across several machine learning models with some models even benefiting from removing nearly 50% of the most frequent unigrams from the bag-of-word vectors.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    20
    References
    2
    Citations
    NaN
    KQI
    []