The quest for better clinical word vectors: Ontology based and lexical vector augmentation versus clinical contextual embeddings

2021 
Abstract Background Word vectors or word embeddings are n-dimensional representations of words and form the backbone of Natural Language Processing of textual data. This research experiments with algorithms that augment word vectors with lexical constraints that are popular in NLP research and clinical domain constraints derived from the Unified Medical Language System (UMLS). It also compares the performance of the augmented vectors with Bio + Clinical BERT vectors which have been trained and fine-tuned on clinical datasets. Methods Word2vec vectors are generated for words in a publicly available de-identified Electronic Health Records (EHR) dataset and augmented by ontologies using three algorithms that have fundamentally different approaches to vector augmentation. The augmented vectors are then evaluated alongside publicly available Bio + Clinical BERT on their correlation with human-annotated lists using Spearman's correlation coefficient. They are also evaluated on the downstream task of Named Entity Recognition (NER). Quantitative and empirical evaluations are used to highlight the strengths and weaknesses of the different approaches. Results The counter-fitted word2vec vectors augmented with information from the UMLS ontology produced the best correlation overall with human-annotated evaluation lists (Spearman's correlation of 0.733 with mini mayo-doctors’ annotation) while Bio + Clinical BERT produces the best results in the NER task (F1 of 0.87 and 0.811 on the i2b2 2010 and i2b2 2012 datasets respectively) in our experiments. Conclusion Clinically adapted word2vec vectors successfully encapsulate concepts of lexical and clinical synonymy and antonymy and to a smaller extent, hyponymy and hypernymy. Bio + Clinical BERT vectors perform better at NER and avoid out-of-vocabulary words.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    33
    References
    0
    Citations
    NaN
    KQI
    []