SYN2020: A New Corpus of Czech with an Innovated Annotation

2021 
The paper introduces the SYN2020 corpus, a newly released representative corpus of written Czech following the tradition of the Czech National Corpus SYN series. The design of SYN2020 incorporates several substantial new features in the area of segmentation, lemmatization and morphological tagging, such as a new treatment of lemma variants, a new system for identifying morphological categories of verbs or a new treatment of multiword tokens. The annotation process, including data and tools used, is described, and the tools and accuracy of the annotation are discussed as well.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    13
    References
    0
    Citations
    NaN
    KQI
    []