Assessing and Minimizing the Impact of OCR Quality on Named Entity Recognition
2020
In digital libraries, the accessibility of digitized documents is directly related to the way they are indexed. Named entities are one of the main entry points used to search and retrieve digital documents. However, most digitized documents are indexed through their OCRed version and OCR errors may hinder their accessibility. This paper aims to quantitatively estimate the impact of OCR quality on the performance of named entity recognition (NER). We tested state-of-the-art NER techniques over several evaluation benchmarks, and experimented with various levels and types of synthesised OCR noise so as to estimate the impact of OCR noise on NER performance. We share all corresponding datasets. To the best of our knowledge, no other research work has systematically studied the impact of OCR on named entity recognition over datasets in multiple languages. The final outcome of this study is an evaluation over historical newspaper data of the national library of Finland, resulting in an increase of around 11% points in terms of F1-measure over the best-known results to this day.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
36
References
6
Citations
NaN
KQI