Deep Learning for the Diagnosis of Stage in Retinopathy of Prematurity: Accuracy and Generalizability across Populations and Cameras.

Jimmy Chen,Aaron S. Coyner,Susan Ostmo,Kemal Sonmez,Sanyam Bajimaya,Eli Pradhan,Nita Valikodath,Emily D. Cole,Tala Al-Khaled,R.V. Paul Chan,Praveer Singh,Jayashree Kalpathy-Cramer,Michael F. Chiang,J. Peter Campbell

Deep Learning for the Diagnosis of Stage in Retinopathy of Prematurity: Accuracy and Generalizability across Populations and Cameras.

2021

Abstract Purpose The presence of stage is an important feature to identify in retinal images of infants at risk for retinopathy of prematurity (ROP). The purpose of this study was to implement a convolutional neural network (CNN) for binary detection of stage 1-3 in ROP and evaluate its generalizability across different populations and camera systems. Design Diagnostic validation study of CNN for stage detection. Subjects, Participants, and/or Controls Retinal fundus images obtained from preterm infants during routine ROP screenings. Methods Two datasets were used: 6247 fundus images taken by a RetCam camera from nine North American institutions, and 4647 images taken by a Forus 3nethra camera from four hospitals in Nepal. Images were labeled based on the presence of stage by 1-3 expert graders. Three CNN models were trained using 5-fold cross-validation on datasets from North America alone, Nepal alone, and a combined dataset and evaluated on two held-out test sets consisting of 708 and 247 images from the Nepali and North American datasets respectively. Main Outcome Measures CNN performance was evaluated using area under the receiver operating curve (AUROC) and precision-recall curve (AUPRC), sensitivity, and specificity. Results Both the North American- and Nepali-trained models demonstrated high performance on a test set from the same population: (AUROC/AUPRC) 0.99/0.98 with sensitivity of 94%, and 0.97/0.91 with sensitivity of 73%, respectively. However, the performance of each model decreased to 0.96/0.88 (sensitivity 52%) and 0.62/0.36 (sensitivity 44%) when evaluated on a test set from the other population. Compared to the models trained on individual datasets, the model trained on a combined dataset achieved improved performance on each respective test set: sensitivity improved from 94% to 98% on the North American test set, and from 73% to 82% on the Nepali test set. Conclusions A CNN can accurately identify the presence of ROP stage in retinal images, but performance depends on the similarity between training and testing populations. We demonstrate that internal and external performance can be improved by increasing the heterogeneity of the training dataset features of the training dataset, in this case by combining images from different populations and cameras.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations