Abstract 2101: Deep learning for automatic extraction of tumor site and histology from unstructured pathology reports

2020 
Introduction: Much of the information in electronic medical records (EMRs) required for the practice of clinical oncology is contained in unstructured text. While natural language processing (NLP) has been used to extract information from EMR text, accuracy is suboptimal. In late 2018 a powerful new deep-learning NLP algorithm was published: Bidirectional Encoder Representations from Transformers (BERT). BERT set new accuracy records and for the first time achieved human-level performance on several NLP benchmarks. Our goal was to train BERT to extract clinically relevant data from pathology reports with high accuracy. Procedures: Like many cancer centers nationwide, Moffitt Cancer Center employs Certified Tumor Registrars (CTRs) to collect and report data about cancer patients to state and federal agencies. The CTR extracted data are labels that identify, with high accuracy, important information in each pathology report. Consequently, we used this data to tune BERT to perform a question-and-answering (QA and, the F1 statistic. The latter produces a value between 0% and 100% indicating the degree of overlap between words in the BERT-extracted data and words in the CTR-extracted data. Results: The final dataset contained 14,143 pathology reports (11,520 for training, 2,623 for testing). This dataset included tumors from 228 organ sites involving 232 histological classifications. The three most common organ sites / histological classifications were: Prostate Gland / Adenocarcinoma (6.7%); Breast / Invasive Carcinoma (6.1%); and, Breast Overlapping Lesion / Invasive Carcinoma (5.9%). Our BERT-based QA 2020 Apr 27-28 and Jun 22-24. Philadelphia (PA): AACR; Cancer Res 2020;80(16 Suppl):Abstract nr 2101.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []