Quality issues in thesaurus building: a case study from the medical domain

2012 
To ensure the quality of a medical thesaurus is a non-trivial task, due to the inherent complexity of medical terminology. The peculiarities of the medical sublanguage and the subjectivism of lexicographers' choices complicate the thesaurus construction process. Our experience is based on the MorphoSaurus lexicon, the basis of a biomedical cross-language indexing and retrieval system. We describe two complementary maintenance approaches, viz. i) corpus-based error detection, and ii) thesaurus anomaly detection. These techniques were developed to detect so-called dynamic and static errors, which are committed by the lexicographers during the construction and maintenance process. Considering multilingual parallel corpora, the distribution of semantic identifiers should be similar whenever comparing related texts in different languages. In the first approach, those semantic identifiers are identified that exhibit greatest frequency variations when comparing text pairs. A manual review of these search results is supposed to spot content errors, which are subsequently classified and fixed by the lexicographers. The second approach analyses transaction-based anomalies, which are identified by interpreting the log of lexicographers' actions during thesaurus maintenance. This methodology highlights the four most common types of this kind of anomaly and evaluates the effectiveness of the corpus-based detection techniques. The overall quality improvement of the thesaurus was evaluated using the OHSUMED IR benchmark.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    19
    References
    3
    Citations
    NaN
    KQI
    []