Mining Lexical Variants from Microblogs: An Unsupervised Multilingual Approach

Alejandro Mosquera,Paloma Moreda Pozo

Mining Lexical Variants from Microblogs: An Unsupervised Multilingual Approach

2014

Alejandro Mosquera
Paloma Moreda Pozo

User-generated content has become a recurrent resource for NLP tools and applications, hence many efforts have been made lately in order to handle the noise present in short social media texts. The use of normalisation techniques has been proven useful for identifying and replacing lexical variants on some of the most informal genres such as microblogs. But annotated data is needed in order to train and evaluate these systems, which usually involves a costly process. Until now, most of these approaches have been focused on English and they were not taking into account demographic variables such as the user location and gender. In this paper we describe the methodology used for automatically mining a corpus of variant and normalisation pairs from English and Spanish tweets.

Keywords:

Social media
Natural language processing
Data mining
Microblogging
Engineering
Artificial intelligence

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations