Vocabulary completion through word cooccurrence analysis using unlabeled documents for text categorization

Automated text categorization consists of developing computer programs able to autonomously assign texts to predefined categories, on the basis of their content Such applications are possible thanks to supervised learning, which implies a training on manually labeled documents During this phase, the system discovers links between relevant terms (the vocabulary) and identified categories However, the construction of a training set is long and expensive This paper suggests a way to assist text classifiers in the gathering of the vocabulary when the number of examples is limited, in which case the success rate is not at its best It proposes to analyze word cooccurrence within a collection of non-labeled documents in order to augment the vocabulary used by the classifier The representation of new documents to classify would benefit from this augmented vocabulary What is expected is an improvement of the classifier's success rate despite its limited training set.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader