Learning from an Imbalanced and Limited Dataset and an Application to Medical Imaging

2019 
Chest X-rays (CXRs) are routinely acquired in medical imaging for the purpose of diagnosing lung diseases. But for many patients, accurate and timely radiologic interpretation of the acquired CXRs is not always feasible, due to limited medical personnel and resources. A computer aided diagnosis (CAD) system based on machine learning would be an effective solution to enhance the efficiency of disease diagnosis. However, obtaining a sufficiently large-scale, balanced, and annotated dataset of CXRs for effectively training a CAD system is challenging in practice. In this paper, we present a comprehensive comparative study on learning from imbalanced and limited CXRs to detect pneumonia, tackling two main questions: (1) Is data sampling an effective method for improving the performance of learning models? (2) Are there quantifiable differences between learning models with different sampling techniques? With respect to data sampling, we investigate two general categories of techniques that modify of an imbalanced data set to deliver a balanced data distribution: (i) undersampling the majority class; and (ii) oversampling/augmentation of the minority class. With respect to learning models, we focus on Support Vector Machine (SVM) and deep convolutional neural network (CNN). Using a publicly available CXR dataset, we demonstrate that SVM and CNN learning models both exhibit improved performance, with the proper selection of the data sampling strategies.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    11
    References
    0
    Citations
    NaN
    KQI
    []