The Hungarian Gigaword Corpus

Csaba Oravecz,Tamás Váradi,Bálint Sass

The Hungarian Gigaword Corpus

2014

Csaba Oravecz
Tamás Váradi
Bálint Sass

The paper reports on the development of the Hungarian Gigaword Corpus, an extended new edition of the Hungarian National Corpus, with upgraded and redesigned linguistic annotation and an increased size of 1.5 billion tokens. Issues concerning the standard steps of corpus collection and preparation are discussed with special emphasis on linguistic analysis and annotation due to Hungarian having some challenging characteristics with respect to computational processing.

Keywords:

Artificial intelligence
Natural language processing
Computer science
Annotation
linguistic analysis
Linguistics

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations