The development of a fine grained class set for Amazigh POS tagging

2013 
Like most of the languages which have only recently started being investigated for the Natural Language Processing (NLP) tasks, Amazigh lacks annotated corpora and tools and still suffers from the scarcity of linguistic tools and resources. The main aim of this paper is to present a tokenizer tool and a new part-of-speech (POS) tagger based on a new Amazigh tag set (AMTS) composed of 28 tag. In line with our goal we have trained two sequence classification models using Support Vector Machines (SVMs) and Conditional Random Fields (CRFs) to build a toknizer and a POS tagger for the Amazigh language. We have used the 10-fold technique to evaluate and validate our approach. We report that POS tagging results using SVMs and CRFs are very comparable. Across the board, CRFs outperformed SVMs on the fold level (91.18% vs. 90.75%) and CRFs outperformed SVMs on the 10 folds average level (87.95% vs. 87.11%). Regarding tokenization task, SVMs outperformed CRFs on the fold level (99.97% vs. 99.85%) and on the 10 folds average level (99.95% vs. 99.89%).
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    23
    References
    2
    Citations
    NaN
    KQI
    []