Benchmarking Automatic Segmentation Algorithms Against Human Interobserver Variability of Prostate and Organs at Risk Delineation on Prostate MRI.

2021 
PURPOSE/OBJECTIVE(S) The objective of this study was to benchmark custom automatic segmentation software against human interobserver variability of prostate and OAR segmentation on 3D MRIs for prostate radiotherapy. We hypothesized that automatic segmentation algorithms can exhibit consistency comparable to human readers, including radiation oncologists. MATERIALS/METHODS Twenty-five patients underwent MRI-based treatment planning and quality assessment for low-dose-rate prostate brachytherapy (LDRPBT). All 25 patients received both treatment planning MRIs and postimplant quality assessment MRIs. The MRIs were acquired on 3T scanners with flexible external array coils and no endorectal coil. The preimplant MRIs were acquired with a 3D T2-weighted TSE sequence. The postimplant MRIs were acquired with a 3D T2/T1-weighted fully balanced SSFP sequence. Five human observers involved in the LDRPBT workflow, including 2 radiation oncologists, delineated the prostate and 4 OARs on the 50 MRIs. The OARs included the external urinary sphincter (EUS), seminal vesicles (SV), rectum, and bladder. Human interobserver variability (IoV) amongst the 5 human readers was quantified in a human observer study. Similarity metrics, including precision (P), recall (R), Jaccard index (JI), and Matthew's correlation coefficient (MCC), were quantified between unique observer pairs for all 5 organ segmentation masks. A custom deep learning-based autosegmentation algorithm was used to segment the prostate and 4 OARs simultaneously on the same 50 MRIs that were annotated by the human readers. P, R, JI, and MCC were evaluated at different probability thresholds. The comparisons were made against reference segmentation masks of the 5 organs computed from the 5 human segmentation masks using the simultaneous truth and performance level estimation algorithm. The human IoV similarity metrics were compared to the computer similarity metrics at the point of maximum MCC using two-tailed nonparametric Wilcoxon rank sum tests. RESULTS For the preimplant MRIs, the computer similarity metrics (C) were lower than the human IoV similarity metrics (H) for all organs except the EUS (MCC, C vs. H: prostate - 0.868 vs. 0.903, P < 0.0001; EUS - 0.610 vs. 0.552, P = 0.1690; SV - 0.678 vs. 0.750, P = 0.0094; rectum - 0.790 vs. 0.886, P < 0.0001; bladder - 0.879 vs. 0.941, P < 0.0001). However, the opposite was observed for the postimplant MRIs; C was significantly higher than H for all 5 organs (MCC, C vs. H: prostate - 0.925 vs. 0.892, P < 0.0001; EUS - 0.743 vs. 0.620, P = 0.0003; SV - 0.849 vs. 0.760, P < 0.0001; rectum - 0.925 vs. 0.876, P < 0.0001; bladder - 0.972 vs. 0.938, P < 0.0001). CONCLUSION The autosegmentation algorithm produced high quality segmentation masks for both the preimplant and postimplant MRIs. The human readers were more consistent than the autosegmentation algorithm on the T2-weighted preimplant MRIs, but the autosegmentation algorithm was more consistent than the human readers on the postimplant MRIs.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []