Facilitating Access to Historical Documents by Improving Digitisation Results

2020 
Born-analog documents contain enormous knowledge which is valuable to our society. For the purpose of preservation and easy accessibility, several digitisation projects have converted these documents into digital texts by using optical character recognition (OCR) software. Some existing problems of OCR techniques prevent users and further processes from accessing, searching, or retrieving information on these digitised collections, and so limit the benefits of these above projects. A notable limitation is the fact that certain meaningful structures such as chapters, sections, etc., are not available from OCRed books. Thus, it is not convenient for users to navigate or search information inside books. Another constraint is that the accuracy of modern OCR engines on historical documents substantially decreases. Erroneous OCR output considerably impacts on the performance of search engines and natural language processing systems. This thesis facilitates access to historical digitised documents by addressing such problems. Several approaches are proposed within this thesis, aiming to reconstruct the logical book structures and to improve the quality of digitised text. The first contribution is to rebuild the logical book structures. An ensemble method is introduced to extract tables of contents of digitised books. Experimental results show that our approach outperforms the state-of-the-art for both evaluation metrics. The major contribution of this thesis is to provide methodologies to reduce OCR errors. Common and different features between OCR errors and human misspellings are clarified for better designing post-OCR processing. Normally, a post-processing system detects and corrects remaining errors. However, it is reasonable to treat them separately in some applications which allow to filter out, flag, or selectively reprocess such data. In this thesis, we examine different post-OCR approaches, ones based on error model and language model, and others that involve neural network models. Results reveal that the performance of our proposals is comparable to several strong baselines on English datasets of the two competitions on post-OCR text correction organised in the International Conference on Document Analysis and Recognition in 2017 and 2019.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []