Arabic text classification: New study

2017 
Text classification performance is considerably influenced by a factor selected from the text and presented to the classification algorithm: the feature type. Character N-grams, word roots, word stems, and full words have been altogether used as features for Arabic text classification. No prior studies, as shown in a survey of current literature, have been conducted on the effect of using root N-grams and stem N-grams (N consecutive roots or stems) on Arabic Text classification performance. Consequently, we conducted 108 experiments. For these, three Feature types (1-grams, 2-grams, and 3-grams) of roots, stems and full words were used. For feature selection method, chi square was employed with three thresholds for numbers of features (100, 500, and 1000). As a representation schema, term frequency-inversed document frequency was utilized. Three classifiers were brought to action alongside; Naive Bayes, K-Nearest Neighbor, and Support Vector Machine. Results show that, compared to stem or word N-grams, the use of root 1-grams as a feature provides greater classification performance for Arabic text classification. It was made manifest, as well, that classification performance decreases whenever the number of N-grams increases. The data exhibit, also, that the support vector machine outperforms Naive Bayes and k-nearest neighbor with 1-grams. Whenever the K-Nearest Neighbor was used, however, Root 2-grams achieved the best performance. Root 3-grams, on the other hand, achieved the best performance whenever the Support Vector Machine was used.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    34
    References
    1
    Citations
    NaN
    KQI
    []