OCR Improvements for Images of Multi-page Historical Documents

Ivan Gruber,Marek Hrúz,Pavel Ircing,Petr Neduchal,Tomáš Zítka,Miroslav Hlaváč,Zbyněk Zajíc,Jan Švec,Martin Bulín

OCR Improvements for Images of Multi-page Historical Documents

2021

This work presents a pipeline for processing digitally scanned documents, reading their textual content, and storing it in a dataset for the purpose of information retrieval. The pipeline is able to handle images of various quality, whether they were obtained by a digital scanner or camera. The image can contain multiple pages in any layout, but an approximate upright orientation is assumed. The pipeline uses Faster R-CNN to detect individual pages. These are then processed by a deskew algorithm to correct the orientation, and finally read by the Tesseract OCR system that has been retrained on a large set of synthetic images and a small set of annotated real-world documents. By applying the pipeline, we were able to increase the word recall to 60.56% which is an absolute gain of 19.19% from the baseline solution that uses only Tesseract OCR. A demo of the proposed pipeline can be found at https://archivkgb.zcu.cz/.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations