ToKSA - Tokenized Key Sentence Annotation - a Novel Method for Rapid Approximation of Ground Truth for Natural Language Processing
2021
Objective
Identifying phenotypes and pathology from free text is an essential task for clinical work and research. Natural language processing (NLP) is a key tool for processing free text at scale. Developing and validating NLP models requires labelled data. Labels are generated through time-consuming and repetitive manual annotation and are hard to obtain for sensitive clinical data. The objective of this paper is to describe a novel approach for annotating radiology reports.
Materials and Methods
We implemented tokenized key sentence-specific annotation (ToKSA) for annotating clinical data. We demonstrate ToKSA using 180,050 abdominal ultrasound reports with labels generated for symptom status, gallstone status and cholecystectomy status. Firstly, individual sentences are grouped together into a term-frequency matrix. Annotation of key (i.e. the most frequently occurring) sentences is then used to generate labels for multiple reports simultaneously. We compared ToKSA-derived labels to those generated by annotating full reports. We used ToKSA-derived labels to train a document classifier using convolutional neural networks. We compared performance of the classifier to a separate classifier trained on labels based on the full reports.
Results
By annotating only 2,000 frequent sentences, we were able to generate labels for symptom status for 70,000 reports (accuracy 98.4%), gallstone status for 85,177 reports (accuracy 99.2%) and cholecystectomy status for 85,177 reports (accuracy 100%). The accuracy of the document classifier trained on ToKSA labels was similar (0.1-1.1% more accurate) to the document classifier trained on full report labels.
Conclusion
ToKSA offers an accurate and efficient method for annotating free text clinical data.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
23
References
0
Citations
NaN
KQI