Some Investigations on Machine Learning Techniques for Automated Text Categorization

International Journal of Computer Applications (2013)

Bhagirath Prajapati Sanjay Garg Nihal Chauhan

Citation

Reference

Related Paper

Citation Trend

Abstract:

The automated categorization (classification) of texts into predefined categories is one of the widely explored fields of research in text mining.Now-a-days, availability of digital data is very high, and to manage them in predefined categories has become a challenging task.Machine learning technique is an approach by which we can train automated classifier to classify the documents with minimum human assistance.This paper discusses the Naïve Bayes, Rocchio, k-Nearest Neighborhood and Support Vector Machine methods within machine learning paradigm for automated text categorization of given documents in predefined categories.

Keywords:

Text Categorization

Topics:

Text and Document Classification Technologies

Advanced Text Analysis Techniques

Spam and Phishing Detection

10.5120/12340-8617

Cite

PDF

Feature Selection in Text Categorization

Jisuanji gongcheng (2004)

Hantao Song

The paper studies feature selection in text categorization learning. It focuses on dimensionality reduction. Because high dimensionality feature sets are not all important and available in categorization learning. In the end some categorization methods and characteristics were introduced.

Text Categorization

Feature (linguistics)

Source

Cite

Citations (4)

Automatic text categorization for patent data

Journal of Computer Applications (2008)

Sun Zhi-hui

At present, there are no practical and mature automatic text categorization methods for patent data. Therefore, this paper made a research on several key techniques about text categorization, improved the non-dictionary segment and weight calculation, and then proposed a hierarchical categorization method and an automatic text categorization framework for patent data. The experiment testifies that the system has a good classification accuracy and efficiency.

Text Categorization

Source

Cite

Citations (0)

Modern Text Categorization Technology Analyse

Journal of the Chinese People's Armed Police Force Academy (2007)

Zhou Wenxia

This paper analyses current text categorization technology,summary the most important problems which should be solved by text categorization;after compared with different categorization algorithms,and different text model,it provides some indications and advices for optimizing current text categorization technology.

Text Categorization

Source

Cite

Citations (0)

Term Weighting Schemes for Question Categorization

IEEE Transactions on Pattern Analysis and Machine Intelligence (2010)

Xiaojun Quan Liu Wenyin Bite Qiu

Term weighting has proven to be an effective way to improve the performance of text categorization. Very recently, with the development of user-interactive question answering or community question answering, there has emerged a need to accurately categorize questions into predefined categories. However, as a question is usually a piece of short text, can the existing term-weighting methods perform consistently in question categorization as they do in text categorization? The answer is not clear, since to the best of our knowledge, we have not seen any work related to this problem despite of its significance. In this study, we investigate the popular unsupervised and supervised term-weighting methods for question categorization. At the same time, we propose three new supervised term-weighting methods, namely, gf* icf, igf* gf* icf, and vrf. Comparisons of them with existing unsupervised and supervised term weighting methods are made through a series of experiments on question collections of Yahoo! Answers. The experimental results show that igf* gf* icf achieves the best performance among all term-weighting methods, while gf*icf and vrf are also competitive for question categorization. Meanwhile, tf* OR is proven to be the most significant one among existing methods. In addition, igf* gf* icf and vrf are also effective for long document categorization.

10.1109/tpami.2010.154

Cite

Citations (80)

Blog categorization exploiting domain dictionary and dynamically estimated domains of unknown words

Chikara Hashimoto Sadao Kurohashi

This paper presents an approach to text categorization that i) uses no machine learning and ii) reacts on-the-fly to unknown words. These features are important for categorizing Blog articles, which are updated on a daily basis and filled with newly coined words. We categorize 600 Blog articles into 12 domains. As a result, our categorization method achieved an accuracy of 94.0% (564/600).

Text Categorization

Basis (linear algebra)

10.3115/1557690.1557709

Cite

Citations (8)

k-NN Text Categorization Method Based on Transferable Belief Model

International Conference on Internet Technology and Applications (2011)

Xuefeng Fu Liu Qiu-yun

The k-nearest neighbors(k-NN) categorization method is simple and effective in text categorization. The uncertainty of training documents and classes border would appear in multi-class categorization, because of the overlapping of classes and the lack of features. But the conventional k-NN method is unsuitable to deal with this uncertainty. To this problem, a k-NN text categorization method based on the transferable belief model(TBM) is presented in the paper, It's convenient to make decision about the true class membership of a text to be classified through the application of the pignistic transformation. The experiment shows the method improve the precision and recall of text categorization.

Text Categorization

10.1109/itap.2011.6006174

Cite

Citations (0)

On the strength of hyperclique patterns for text categorization

Information Sciences (2007)

Tieyun Qian Hui Xiong Yuanzhen Wang Enhong Chen

Text Categorization

Association (psychology)

10.1016/j.ins.2007.04.005

Cite

Citations (17)

Realization of Text Categorization for Small-Scaled Dataset

Advanced materials research (2012)

Liu Hua

Testing of the text categorization and comparison testing is carried out based on small-scaled dataset. In case of lack of trained set, without training, the indexed text keywords are used to categorize the expert subject terms, with large categorization accuracy amounted to 0.82. In case of less trained set, after training, the characteristics vectors acquired from the training are added into experts’ subject terms and are categorized, with large accuracy amounted to 0.94, the level-3 accuracy amounted to 0.73, so the results are satisfying.

Realization (probability)

Text Categorization

Training set

10.4028/www.scientific.net/amr.532-533.1239

Cite

Citations (0)

Study on Improved CHI for feature selection in Chinese text categorization

Computer Engineering and Applications Journal (2011)

Pei Yingbo Xiaoxia Liu

This paper analyzes the factors which influence the CHI categorization accuracy and removes the negative correlation between the items and the category.The improved approach is applied to weight adjustment,obviously improving categorization quality.Furthermore,concentration information,distribution information and frequency information are introduced into the improved approach,which increases the categorization accuracy on the corpus of category uneven distribution.The experimental results verify the efficiency and probability of the improved CHI approach.

Text Categorization

Feature (linguistics)

Source

Cite

Citations (10)

Text categorization algorithms representations based on inductive learning

Jianfang Cao Hongbin Wang

Text categorization-assignment of natural language texts to one or more predefined categories based on their content-is an important component in many information organization and management tasks. Categorization algorithm is the most critical factor to text categorization system performance. The inductive learning classifiers are put forward. Very accurate text categorization result can be learned automatically from training examples.

Text Categorization

Component (thermodynamics)

Factor (programming language)

Inductive bias

Inductive Reasoning

10.1109/icime.2010.5477992

Cite

Citations (4)