By all these lovely tokens... Merging Conflicting Tokenizations

2009 
Given the contemporary trend to modular NLP architectures and multiple annotation frameworks, the existence of concurrent tokenizations of the same text represents a pervasive problem in everyday's NLP practice and poses a non-trivial theoretical problem to the integration of linguistic annotations and their interpretability in general. This paper describes a solution for integrating different tokenizations using a standoff XML format, and discusses the consequences for the handling of queries on annotated corpora.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    38
    References
    20
    Citations
    NaN
    KQI
    []