Preliminary exploration of topic modelling representations for Electronic Health Records coding according to the International Classification of Diseases in Spanish

2022 
In this work, we cope with the classification of Electronic Health Records (EHR) in Spanish according to the International Classification of Diseases (ICD). We employ Topic Models representing each document as a probabilistic distribution over topics, offering a low-dimensional representation of documents.The trend is to turn to an embedding text representation, but these approaches require large amounts of textual data. We found Topic Models as a suitable alternative approach to deal with the few resources available for Spanish clinical text mining. Besides, they are interpretable and aid the explainability in artificial intelligence (XAI).We explored two different methods, known as Latent Dirichlet Allocation (LDA) and Partially Labelled Latent Dirichlet Allocation (PLDA), the supervised approach of the former. We assessed the results attained in Spanish with an analogous task in English as a reference. Evaluation methods were applied directly to the representation, with metrics to determine topic coherence and the relationship between topics and ICD labels.We learned that PLDA was able to discover topics associated with the ICD. This finding means that this representation itself can reveal ICD codes previous to classification. Also, this representation was used as predictive features to feed a conventional classifier to show their competence in a downstream task. We conclude that in a context with a lack of big data availability, PLDA emerges as a versatile candidate, able to offer a competitive representation of EHRs.While other works are primarily concerned with supervised categorization and do not pay attention to the representation, LDA and PLDA offer an interpretable approach that can be associated with ICDs. Moreover, compared with those that employ LDA, we demonstrate how its’ supervised version, PLDA, can be more intuitive as it shows a closer relation with the ICDs.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []