logo
    Some Investigations on Machine Learning Techniques for Automated Text Categorization
    1
    Citation
    9
    Reference
    10
    Related Paper
    Citation Trend
    Abstract:
    The automated categorization (classification) of texts into predefined categories is one of the widely explored fields of research in text mining.Now-a-days, availability of digital data is very high, and to manage them in predefined categories has become a challenging task.Machine learning technique is an approach by which we can train automated classifier to classify the documents with minimum human assistance.This paper discusses the Naïve Bayes, Rocchio, k-Nearest Neighborhood and Support Vector Machine methods within machine learning paradigm for automated text categorization of given documents in predefined categories.
    Keywords:
    Text Categorization
    The paper studies feature selection in text categorization learning. It focuses on dimensionality reduction. Because high dimensionality feature sets are not all important and available in categorization learning. In the end some categorization methods and characteristics were introduced.
    Text Categorization
    Feature (linguistics)
    Citations (4)
    At present, there are no practical and mature automatic text categorization methods for patent data. Therefore, this paper made a research on several key techniques about text categorization, improved the non-dictionary segment and weight calculation, and then proposed a hierarchical categorization method and an automatic text categorization framework for patent data. The experiment testifies that the system has a good classification accuracy and efficiency.
    Text Categorization
    Citations (0)
    This paper analyses current text categorization technology,summary the most important problems which should be solved by text categorization;after compared with different categorization algorithms,and different text model,it provides some indications and advices for optimizing current text categorization technology.
    Text Categorization
    Citations (0)
    Term weighting has proven to be an effective way to improve the performance of text categorization. Very recently, with the development of user-interactive question answering or community question answering, there has emerged a need to accurately categorize questions into predefined categories. However, as a question is usually a piece of short text, can the existing term-weighting methods perform consistently in question categorization as they do in text categorization? The answer is not clear, since to the best of our knowledge, we have not seen any work related to this problem despite of its significance. In this study, we investigate the popular unsupervised and supervised term-weighting methods for question categorization. At the same time, we propose three new supervised term-weighting methods, namely, gf* icf, igf* gf* icf, and vrf. Comparisons of them with existing unsupervised and supervised term weighting methods are made through a series of experiments on question collections of Yahoo! Answers. The experimental results show that igf* gf* icf achieves the best performance among all term-weighting methods, while gf*icf and vrf are also competitive for question categorization. Meanwhile, tf* OR is proven to be the most significant one among existing methods. In addition, igf* gf* icf and vrf are also effective for long document categorization.
    Citations (80)
    This paper presents an approach to text categorization that i) uses no machine learning and ii) reacts on-the-fly to unknown words. These features are important for categorizing Blog articles, which are updated on a daily basis and filled with newly coined words. We categorize 600 Blog articles into 12 domains. As a result, our categorization method achieved an accuracy of 94.0% (564/600).
    Text Categorization
    Basis (linear algebra)
    Citations (8)
    The k-nearest neighbors(k-NN) categorization method is simple and effective in text categorization. The uncertainty of training documents and classes border would appear in multi-class categorization, because of the overlapping of classes and the lack of features. But the conventional k-NN method is unsuitable to deal with this uncertainty. To this problem, a k-NN text categorization method based on the transferable belief model(TBM) is presented in the paper, It's convenient to make decision about the true class membership of a text to be classified through the application of the pignistic transformation. The experiment shows the method improve the precision and recall of text categorization.
    Text Categorization
    Citations (0)
    Testing of the text categorization and comparison testing is carried out based on small-scaled dataset. In case of lack of trained set, without training, the indexed text keywords are used to categorize the expert subject terms, with large categorization accuracy amounted to 0.82. In case of less trained set, after training, the characteristics vectors acquired from the training are added into experts’ subject terms and are categorized, with large accuracy amounted to 0.94, the level-3 accuracy amounted to 0.73, so the results are satisfying.
    Realization (probability)
    Text Categorization
    Training set
    This paper analyzes the factors which influence the CHI categorization accuracy and removes the negative correlation between the items and the category.The improved approach is applied to weight adjustment,obviously improving categorization quality.Furthermore,concentration information,distribution information and frequency information are introduced into the improved approach,which increases the categorization accuracy on the corpus of category uneven distribution.The experimental results verify the efficiency and probability of the improved CHI approach.
    Text Categorization
    Feature (linguistics)
    Citations (10)
    Text categorization-assignment of natural language texts to one or more predefined categories based on their content-is an important component in many information organization and management tasks. Categorization algorithm is the most critical factor to text categorization system performance. The inductive learning classifiers are put forward. Very accurate text categorization result can be learned automatically from training examples.
    Text Categorization
    Component (thermodynamics)
    Factor (programming language)
    Inductive bias
    Inductive Reasoning