Language Identification of Hindi-English tweets using code-mixed BERT

Mohd. Zeeshan Ansari,M. M. Sufyan Beg,Tanvir Ahmad,Mohd Jazib Khan,Ghazali Wasim

Language Identification of Hindi-English tweets using code-mixed BERT

2021

Language identification of social media text has been an interesting problem of study in recent years. Social media messages are predominantly in code mixed in non-English speaking states. Prior knowledge by pre-training contextual embeddings have shown state of the art results for a range of downstream tasks. Recently, models such as BERT have shown that using a large amount of unlabeled data, the pretrained language models are even more beneficial for learning common language representations. Extensive experiments exploiting transfer learning and fine-tuning BERT models to identify language on Twitter are presented in this paper. The work utilizes a data collection of Hindi-English-Urdu codemixed text for language pre-training and Hindi-English codemixed for subsequent word-level language classification. The results show that the representations pre-trained over codemixed data produce better results by their monolingual counterpart.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations