UDPipe at SIGMORPHON 2019: Contextualized Embeddings, Regularization with Morphological Categories, Corpora Merging
0
Citation
0
Reference
10
Related Paper
Abstract:
We present our contribution to the SIGMORPHON 2019 Shared Task: Crosslinguality and Context in Morphology, Task 2: contextual morphological analysis and lemmatization. We submitted a modification of the UDPipe 2.0, one of best-performing systems of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies and an overall winner of the The 2018 Shared Task on Extrinsic Parser Evaluation. As our first improvement, we use the pretrained contextualized embeddings (BERT) as additional inputs to the network; secondly, we use individual morphological features as regularization; and finally, we merge the selected corpora of the same language. In the lemmatization task, our system exceeds all the submitted systems by a wide margin with lemmatization accuracy 95.78 (second best was 95.00, third 94.46). In the morphological analysis, our system placed tightly second: our morphological analysis accuracy was 93.19, the winning system's 93.23.Keywords:
Lemmatisation
Regularization
Merge (version control)
Margin (machine learning)
This is a dataset that contains randomly selected and manually lemmatized sentences from the corpus of Old Literary Finnish. Please cite and consult the original corpus as well: Institute for the Languages of Finland (2013). Corpus of Old Literary Finnish [text corpus]. The Language Bank of Finland. Retrieved from http://urn.fi/urn:nbn:fi:lb-201407165 Currently there are individual decades that have not been lemmatized, which are 1690, 1720, 1740 and 1770. Additionally there are many decades not present in the dataset at all. Adding these and to complete the material in various ways is an important goal for the further research. We also welcome corrections and additions into the dataset by other researchers.
Lemmatisation
Cite
Citations (0)
This is a dataset that contains randomly selected and manually lemmatized sentences from the corpus of Old Literary Finnish. Please cite and consult the original corpus as well: Institute for the Languages of Finland (2013). Corpus of Old Literary Finnish [text corpus]. The Language Bank of Finland. Retrieved from http://urn.fi/urn:nbn:fi:lb-201407165 Currently there are individual decades that have not been lemmatized, which are 1690, 1720, 1740 and 1770. Additionally there are many decades not present in the dataset at all. Adding these and to complete the material in various ways is an important goal for the further research. We also welcome corrections and additions into the dataset by other researchers.
Lemmatisation
Cite
Citations (0)
Lemmatisation
Cite
Citations (1)
The grammatical description of Old English lacks complete and systematic lemmatization, which hinders Natural Language Processing studies in this language, as they strongly rely on the existence of large, annotated corpora. Moreover, the inflectional features of Old English preclude token-based automatic lemmatization. Therefore, specifically goal-oriented applications must be developed to account for the automatic lemmatization of specific variable categories. This article designs an automatic lemmatizer within the framework of Morphological Generation to address the type-based lemmatization of Old English class V strong verbs (L-Y). The lemmatizer is implemented with rules that account for inflectional, derivational and morphophonological variation. The generated forms are compared with the most relevant corpora of Old English for validation before being assigned a lemma. The lemmatizer is successful in supplying form-lemma associations not yet accounted for in the literature, and in identifying mismatches and areas for manual revision.
Lemmatisation
Lemma (botany)
Variation (astronomy)
Cite
Citations (0)
Lemmatisation
Cite
Citations (14)
This paper aims at presenting some preliminary results for data driven lemmatization for Italian. Besides intrinsic evaluation for this task, we want to measure its usefulness and adequacy by using our system as input for the task of parsing, following a methodology developed on French. This approach achieves state-of-the-art parsing accuracy without requiring any prior knowledge of the language.
Lemmatisation
Cite
Citations (0)
The Reading Machine, is a parsing framework that takes as input raw text and performs six standard nlp tasks: tokenization, pos tagging, morphological analysis, lemmatization, dependency parsing and sentence segmentation. It is built upon Transition Based Parsing, and allows to implement a large number of parsing configurations, among which a fully incremental one. Three case studies are presented to highlight the versatility of the framework. The first one explores whether an incremental parser is able to take into account top-down dependencies (i.e. the influence of high level decisions on low level ones), the second compares the performances of an incremental and a pipe-line architecture and the third quantifies the impact of the right context on the predictions made by an incremental parser.
Lemmatisation
Lexical analysis
Dependency grammar
Cite
Citations (3)
Development of human language technologies for the indigenous South African languages is currently being undertaken in various projects across South Africa. In one such project a lemmatizer for Setswana is being developed, and this article reports on work towards the development of a first prototype. A prerequisite of lemmatization is to determine what the output of a lemmatizer for a specific language should be (i.e. what should be considered a lemma in that language). Consequently, the concept of a lemma as it should be understood in the context of Setswana lemmatization is defined, and it is indicated that only nouns and verbs really pose challenges for the lemmatization of Setswana. The computational approach taken in this research, and the implementation applied, which use FSA 6, are described at length. Preliminary results indicate that the rules for nouns and verbs are rather accurate, with precision scores of 93–94% obtained in a small, contained experiment. The article concludes with a discussion of future work.
Lemmatisation
Lemma (botany)
Cite
Citations (9)
Lemma (botany)
Lexical Database
Lemmatisation
Cite
Citations (1)