OCR Improvements for Images of Multi-page Historical Documents

2021 
This work presents a pipeline for processing digitally scanned documents, reading their textual content, and storing it in a dataset for the purpose of information retrieval. The pipeline is able to handle images of various quality, whether they were obtained by a digital scanner or camera. The image can contain multiple pages in any layout, but an approximate upright orientation is assumed. The pipeline uses Faster R-CNN to detect individual pages. These are then processed by a deskew algorithm to correct the orientation, and finally read by the Tesseract OCR system that has been retrained on a large set of synthetic images and a small set of annotated real-world documents. By applying the pipeline, we were able to increase the word recall to 60.56% which is an absolute gain of 19.19% from the baseline solution that uses only Tesseract OCR. A demo of the proposed pipeline can be found at https://archivkgb.zcu.cz/.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    19
    References
    0
    Citations
    NaN
    KQI
    []