A Lexical Approach to Identifying Subtype Inconsistencies in Biomedical Terminologies

2018 
We introduce a lexical-based inference approach for identifying subtype (or $is_{-}a$ relation) inconsistencies in biomedical terminologies. Given a terminology, we first represent the name of each concept in the terminology as a sequence of words. We then generate hierarchically-linked and-unlinked pairs of concepts, such that the two concepts in a pair have the same number of words, and contain at least one word in common and a fixed number n of different words $(n = 1,2,3,4,5)$. The linked and unlinked concept-pairs further infer corresponding linked and unlinked term-pairs, respectively. If a linked concept-pair and an unlinked concept-pair infer the same term-pair, we consider this as a potential subtype inconsistency, which may indicate a missing subtype relation or an incorrect subtype relation. We applied this approach to Gene Ontology (GO), National Cancer Institute thesaurus (NCIt) and SNOMED CT. A total of 4,841 potential subtype inconsistencies were found in GO, 2,677 in NCIt, and 53,782 in SNOMED CT. Domain experts evaluated a random sample of 211 potential inconsistencies in GO, and verified that 124 of them are valid $(\mathrm {i}.\mathrm {e}.$, a precision of 58.77% for detecting subtype inconsistencies in GO). We also performed a preliminary study on the extent to which external knowledge in the Unified Medical Language System (UMLS) can provide supporting evidence for validating the detected potential inconsistencies: 0.54% $(=26/4841)$ for GO, 11.43% $(=306/2677)$ for NCIt, and 3.61% $(=1940/53782)$ for SNOMED CT. Results indicate that our lexical-based inference approach is a promising way to identify subtype inconsistencies and facilitates the quality improvement of biomedical terminologies.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    22
    References
    3
    Citations
    NaN
    KQI
    []