Domain Identification of Urdu News Text

2019 
In recent years, the amount of data available online has grown exponentially which makes automatic text classification of documents an ever-increasing important task. Many research papers conducted the study on text classification of English, Chinese, Arabic, and other languages. However, fewer studies are done on classification of Urdu documents. This paper addresses the issue of Urdu text classification. For this purpose, we build a high quality Urdu news dataset COUNT19 with labeled categories. In addition, we evaluate the performance of state-of-the-art machine learning algorithms with and without applying pre-processing steps of stopword removal and stemming. Unigram and bigram features with Term Frequency-Inverse Document Frequency (TF-IDF) are used for feature representation of text. Our analysis shows that Multi-Layered Perceptron (MLP) is the best performing classifier with 91.4% accuracy. Additionally, results show that stemming does not improve the performance of classifiers. However, stopword removal impacts the performance of classifiers negatively. This study can help in the selection of a classifier with best performing pre-processing techniques to build Urdu text classification system, which is an essential tool used for Information Retrieval (IR) and Natural Language Processing (NLP) tasks.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    1
    Citations
    NaN
    KQI
    []