Toxic Speech Detection using Traditional Machine Learning Models and BERT and fastText Embedding with Deep Neural Networks

2021 
The introduction of social media brought about a revolution in the world of digitalization and communication. These platforms were initially developed with a purpose of connecting people across the global boundaries while allowing them to express their views and opinions and learn from others’ ideas. With the incoming of the pandemic, the usage of these sites has risen significantly be it by the businesses, educational institutions, students or general public. The increasing ubiquity of social media platforms like Twitter and Facebook has been an issue of major concern since a long time. Along with providing a way for enhanced communication, these platforms also allow internet users to voice their opinions which get circulated among the masses within seconds. Moreover, given the different backgrounds, believes, ethnicity and cultures that the users on these platforms come from, many of them tend to use mean, aggressive and hateful content during their discussions with people not hailing from a background similar to theirs. The amount of hate speech and offensive content has been increasing exponentially. Terms like "profane", "hate", and "offensive" are used interchangeably, and hence these have been classified under a broader category of "Toxic" content. A major part of our dataset focuses on conversations prevailing among the youth. After the preprocessing of this dataset using NLP and embeddings (Bert and fastText), a bunch of Machine Learning (LR, SVM, DT, RF, XGBoost) and Deep Learning algorithms (CNN, MLP, LSTM) have been performed, with CNN giving the best results.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    19
    References
    0
    Citations
    NaN
    KQI
    []