By all these lovely tokens... Merging Conflicting Tokenizations
2009
Given the contemporary trend to modular NLP architectures and multiple annotation frameworks, the existence of concurrent tokenizations of the same text represents a pervasive problem in everyday's NLP practice and poses a non-trivial theoretical problem to the integration of linguistic annotations and their interpretability in general. This paper describes a solution for integrating different tokenizations using a standoff XML format, and discusses the consequences for the handling of queries on annotated corpora.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
38
References
20
Citations
NaN
KQI