Controlling for Confounding Variables: Accounting for Dataset Bias in Classifying Patient-Provider Interactions

2021 
Natural Language Processing (NLP) is a key enabling technology for re-use of information in free-text clinical notes. However, a barrier to deployment is the availability of labeled corpora for supervised machine learning, which are expensive to acquire as they must be annotated by experienced clinicians. Where corpora are available, they may be opportunistically collected and thus vulnerable to bias. Here we evaluate an approach for accounting for dataset bias in the context of identifying specific patient-provider interactions. In this context, bias is the result of a phenomenon being over or under-represented in a particular type of clinical note as a result of the way a dataset was curated. Using a clinical dataset which represents a great deal of variation in terms of author and setting, we control for confounding variables using a backdoor adjustment approach [1, 2], which to our knowledge has not been previously applied the clinical domain. This approach improves precision by up to 5% and the adjusted models’ scores for false positives are generally lower, resulting in a more generalizable model with the potential to enhance the downstream utility of models trained using opportunistically collected clinical corpora.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    22
    References
    2
    Citations
    NaN
    KQI
    []