OCR Improvements for Images of Multi-page Historical Documents
2021
This work presents a pipeline for processing digitally scanned documents, reading their textual content, and storing it in a dataset for the purpose of information retrieval. The pipeline is able to handle images of various quality, whether they were obtained by a digital scanner or camera. The image can contain multiple pages in any layout, but an approximate upright orientation is assumed. The pipeline uses Faster R-CNN to detect individual pages. These are then processed by a deskew algorithm to correct the orientation, and finally read by the Tesseract OCR system that has been retrained on a large set of synthetic images and a small set of annotated real-world documents. By applying the pipeline, we were able to increase the word recall to 60.56% which is an absolute gain of 19.19% from the baseline solution that uses only Tesseract OCR. A demo of the proposed pipeline can be found at https://archivkgb.zcu.cz/.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
19
References
0
Citations
NaN
KQI