BARD: Bangla Article Classification Using a New Comprehensive Dataset

2018 
In the literature, automated Bangla article classification has been studied, where several supervised learning models have been proposed by utilizing a large textual data corpus. Despite several comprehensive textual datasets are available for different languages, a few small datasets are curated on Bangla language. As a result, a few works address Bangla document classification problem, and due to the lack of enough training data, these approaches could not able to learn sophisticated supervised learning model. In this work, we curated a large dataset of Bangla articles from different news portals, which contains around 3,76,226 articles. This huge diverse dataset helps us to train several supervised learning models by utilizing a set of sophisticated textual features, such as word embeddings, TF-IDF. In this works, our learning model shows promising performance on our curated dataset, compared to state-of-the-art works in Bangla article classification. Furthermore, we deployed our proposed Bangla content classifier as a web application: bard2018.pythonanywhere.com and the video demo of this application is available here: bit.lylBARD_ VIDEO_DEMO. Additionally, we open-sourced the BARD dataset(bit.lyIBARD_DATASET) and source code of this work(bit.lvlBARD SC)’
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    14
    References
    19
    Citations
    NaN
    KQI
    []