TweetMT : a parallel microblog corpus

Iñaki San Vicente,Iñaki Alegria,Cristina España-Bonet,Pablo Gamallo,Hugo Gonçalo Oliveira,Eva Martínez Garcia,Antonio Toral,Arkaitz Zubiaga,Nora Aranberri

TweetMT : a parallel microblog corpus

2016

Iñaki San Vicente
Iñaki Alegria
Cristina España-Bonet
Pablo Gamallo
Hugo Gonçalo Oliveira
Eva Martínez Garcia
Antonio Toral
Arkaitz Zubiaga
Nora Aranberri

We introduce TweetMT, a parallel corpus of tweets in four language pairs that combine five languages (Spanish from/to Basque, Catalan, Galician and Portuguese), all of which have an official status in the Iberian Peninsula. The corpus has been created by combining automatic collection and crowdsourcing approaches, and it is publicly available. It is intended for the development and testing of microtext machine translation systems. In this paper we describe the methodology followed to build the corpus, and present the results of the shared task in which it was tested.

Keywords:

Speech recognition
Text corpus
Artificial intelligence
Computer science
Social media
Machine translation
Corpus linguistics
Natural language processing
Catalan
Microblogging
Portuguese
Crowdsourcing
Linguistics

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations