ParaCrawl Corpus version 1.0

Philipp Koehn,Kenneth Heafield,Mikel L. Forcada,Miquel Esplà-Gomis,Sergio Ortiz-Rojas,Gema Ramírez Sánchez,Víctor M. Sánchez-Cartagena,Barry Haddow,Marta Bañón,Marek Střelec,Anna Samiotou,Amir Kamran

ParaCrawl Corpus version 1.0

2018

Philipp Koehn
Kenneth Heafield
Mikel L. Forcada
Miquel Esplà-Gomis
Sergio Ortiz-Rojas
Gema Ramírez Sánchez
Víctor M. Sánchez-Cartagena
Barry Haddow
Marta Bañón
Marek Střelec
Anna Samiotou
Amir Kamran

The January 2018 release of the ParaCrawl is the first version of the corpus. It contains parallel corpora for 11 languages paired with English, crawled from a large number of web sites. The selection of websites is based on CommonCrawl, but ParaCrawl is extracted from a brand new crawl which has much higher coverage of these selected websites than CommonCrawl. Since the data is fairly raw, it is released with two quality metrics that can be used for corpus filtering. An official "clean" version of each corpus uses one of the metrics. For more details and raw data download please visit: http://paracrawl.eu/releases.html

Keywords:

Machine translation
Natural language processing
Text corpus
Raw data
Download
Artificial intelligence
Computer science
parallel corpora

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations