Token Validation in Automatic Corpus Gathering for Yorùbá Language

2017 
Recent methodologies in machine translation depend on the availability of large language corpora. The web being the repository for text and other multimedia content becomes a viable source for such data. However, there is need for text cleaning, as a pre-processing step, since foreign words are inevitably part of the harvested text. Dictionary lookup approach can be adopted for languages with comprehensive lexicon while manual cleaning approach is applied in other cases. Developing a full-coverage lexicon for Yoruba language is a cumbersome task due to the fact that new words can be formed as a result of elision, assimilation and contraction. In this paper, the morphology of Yoruba language was studied and modelled as a Finite State Machine which accepts a word and returns true if the goal state is reached and false otherwise. The FSM model was implemented in Java. A Yoruba dictionary containing 10,443 distinct words in their base form (i.e. without diacritics) and English dictionary with 64,150 distinct words were parsed through the finite state machine.   In addition, 58 web pages sourced from the internet were subjected to classification by the system. Classification of entries from the Yoruba dictionary as valid Yoruba words gave 99.99% accuracy while the classification of entries from the English dictionary as Non-Yoruba words gave 94.07% accuracy. Also, using the threshold of 90% valid Yoruba words in a webpage, all 58 webpages were correctly classified. Result obtained revealed that the approach can reliably be applied in automatic harvesting of Yoruba monolingual corpus from the internet.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    5
    References
    1
    Citations
    NaN
    KQI
    []