Towards POS Tagging Methods for Bengali Language: A Comparative Analysis

2021 
Part of Speech (POS) tagging is recognized as a significant research problem in the field of Natural Language Processing (NLP). It has considerable importance in several NLP technologies. However, developing an efficient POS tagger is a challenging task for resource-scarce languages like Bengali. This paper presents an empirical investigation of various POS tagging techniques concerning the Bengali language. An extensively annotated corpus of around 7390 sentences has been used for 16 POS tagging techniques, including eight stochastic based methods and eight transformation-based methods. The stochastic methods are uni-gram, bi-gram, tri-gram, unigram+bigram, unigram+bigram+trigram, Hidden Markov Model (HMM), Conditional Random Forest (CRF), Trigrams ‘n’ Tags (TnT) whereas the transformation methods are Brill with the combination of previously mentioned stochastic techniques. A comparative analysis of the tagging methods is performed using two tagsets (30-tag and 11-tag) with accuracy measures. Brill combined with CRF shows the highest accuracy of 91.83% (for 11 tagset) and 84.5% (for 30 tagset) among all the tagging techniques.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    14
    References
    1
    Citations
    NaN
    KQI
    []