Inaccurate labels in weakly supervised deep learning: Automatic identification and correction and their impact on classification performance

2020 
In data-driven deep learning-based modeling, data quality may substantially influence classification performance. Correct data labeling for deep learning modeling is critical. In weakly-supervised learning, a challenge lies in dealing with potentially inaccurate or mislabeled training data. In this paper, we proposed an automated methodological framework to identify mislabeled data using two metric functions, namely, cross-entropy loss that indicates divergence between a prediction and ground truth, and influence function that reflects the dependence of a model on data. After correcting the identified mislabels, we measured their impact on the classification performance. We also compared the mislabeling effects in three experiments on two different real-world clinical questions. A total of 10,500 images were studied in the contexts of clinical breast density category classification and breast cancer malignancy diagnosis. We used intentionally flipped labels as mislabels to evaluate the proposed method at a varying proportion of mislabeled data included in model training. We also compared the effects of our method to two published schemes for breast density category classification. Experiment results show that when the dataset contains 10% of mislabeled data, our method can automatically identify up to 98% of these mislabeled data by examining/checking the top 30% of the full dataset. Furthermore, we show that correcting the identified mislabels leads to an improvement in the classification performance. Our method provides a feasible solution for weakly-supervised deep learning modeling in dealing with inaccurate labels.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    40
    References
    13
    Citations
    NaN
    KQI
    []