Deep learning demands a large amount of annotated data, and the annotation task is often crowdsourced for economic efficiency. When the annotation task is delegated to non-experts, the dataset may contain data with inaccurate labels. Noisy labels not only yield classification models with sub-optimal performance, but may also impede their optimization dynamics. In this work, we propose exploiting the pattern recognition capacity of deep convolutional neural networks to filter out supposedly mislabeled cases while training. We suggest a training method that references softmax outputs to judge the correctness of the given labels. This approach achieved outstanding performance compared to the existing methods in various noise settings on a large-scale dataset (Kaggle 2015 Diabetic Retinopathy). Furthermore, we demonstrate a method mining positive cases from a pool of unlabeled images by exploiting the generalization ability. With this method, we won first place on the offsite validation dataset in pathological myopia classification challenge (PALM), achieving the AUROC of 0.9993 in the final submission. Source codes are publicly available.
Retinal vessel segmentation is an indispensable step for automatic detection of retinal diseases with fundoscopic images. Though many approaches have been proposed, existing methods tend to miss fine vessels or allow false positives at terminal branches. Let alone under-segmentation, over-segmentation is also problematic when quantitative studies need to measure the precise width of vessels. In this paper, we present a method that generates the precise map of retinal vessels using generative adversarial training. Our methods achieve dice coefficient of 0.829 on DRIVE dataset and 0.834 on STARE dataset which is the state-of-the-art performance on both datasets.
International challenges have become the de facto standard for comparative assessment of image analysis algorithms given a specific task. Segmentation is so far the most widely investigated medical image processing task, but the various segmentation challenges have typically been organized in isolation, such that algorithm development was driven by the need to tackle a single specific clinical problem. We hypothesized that a method capable of performing well on multiple tasks will generalize well to a previously unseen task and potentially outperform a custom-designed solution. To investigate the hypothesis, we organized the Medical Segmentation Decathlon (MSD) - a biomedical image analysis challenge, in which algorithms compete in a multitude of both tasks and modalities. The underlying data set was designed to explore the axis of difficulties typically encountered when dealing with medical images, such as small data sets, unbalanced labels, multi-site data and small objects. The MSD challenge confirmed that algorithms with a consistent good performance on a set of tasks preserved their good average performance on a different set of previously unseen tasks. Moreover, by monitoring the MSD winner for two years, we found that this algorithm continued generalizing well to a wide range of other clinical problems, further confirming our hypothesis. Three main conclusions can be drawn from this study: (1) state-of-the-art image segmentation algorithms are mature, accurate, and generalize well when retrained on unseen tasks; (2) consistent algorithmic performance across multiple tasks is a strong surrogate of algorithmic generalizability; (3) the training of accurate AI segmentation models is now commoditized to non AI experts.
To evaluate the clinical usefulness of a deep learning-based detection device for multiple abnormal findings on retinal fundus photographs for readers with varying expertise.
Purpose: To evaluate high accumulation of coronary artery calcium (CAC) from retinal fundus images with deep learning technologies as an inexpensive and radiation-free screening method. Methods: Individuals who underwent bilateral retinal fundus imaging and CAC score (CACS) evaluation from coronary computed tomography scans on the same day were identified. With this database, performances of deep learning algorithms (inception-v3) to distinguish high CACS from CACS of 0 were evaluated at various thresholds for high CACS. Vessel-inpainted and fovea-inpainted images were also used as input to investigate areas of interest in determining CACS. Results: A total of 44,184 images from 20,130 individuals were included. A deep learning algorithm for discrimination of no CAC from CACS >100 achieved area under receiver operating curve (AUROC) of 82.3% (79.5%–85.0%) and 83.2% (80.2%–86.3%) using unilateral and bilateral fundus images, respectively, under a 5-fold cross validation setting. AUROC increased as the criterion for high CACS was increased, showing a plateau at 100 and losing significant improvement thereafter. AUROC decreased when fovea was inpainted and decreased further when vessels were inpainted, whereas AUROC increased when bilateral images were used as input. Conclusions: Visual patterns of retinal fundus images in subjects with CACS > 100 could be recognized by deep learning algorithms compared with those with no CAC. Exploiting bilateral images improves discrimination performance, and ablation studies removing retinal vasculature or fovea suggest that recognizable patterns reside mainly in these areas. Translational Relevance: Retinal fundus images can be used by deep learning algorithms for prediction of high CACS.
We described a challenge named "Diabetic Retinopathy (DR)-Grading and Image Quality Estimation Challenge" in conjunction with ISBI 2020 to hold three sub-challenges and develop deep learning models for DR image assessment and grading. The scientific community responded positively to the challenge, with 34 submissions from 574 registrations. In the challenge, we provided the DeepDRiD dataset containing 2,000 regular DR images (500 patients) and 256 ultra-widefield images (128 patients), both having DR quality and grading annotations. We discussed details of the top 3 algorithms in each sub-challenges. The weighted kappa for DR grading ranged from 0.93 to 0.82, and the accuracy for image quality evaluation ranged from 0.70 to 0.65. The results showed that image quality assessment can be used as a further target for exploration. We also have released the DeepDRiD dataset on GitHub to help develop automatic systems and improve human judgment in DR screening and diagnosis.