Tamil Stopword Removal Based on Term Frequency

2020 
As text data in digital form is increasing exponentially nowadays, managing and retrieving these documents becomes difficult. A number of natural language processing (NLP) processes, viz. archival, retrieval, query response, information summarization, etc., highly rely on automatic classification of text documents. This has induced researchers to apply machine learning logic to automatically categorize documents based on languages and within documents belonging to the same language to devise methods to segregate them according to its contents. More than at present, 70% of the total text classification process involves ‘Preprocessing of text’, alone [1]. This indicates its importance of preprocessing and the efficiency based on text classification logic is solely dependent on an efficient preprocessing step. This article deals with corpus creation for Tamil documents and Tamil language stopword removal. Dictionary-based and frequency-based stopword removal methods have been proposed in this work.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    15
    References
    1
    Citations
    NaN
    KQI
    []