Improving Book OCR by Adaptive Language and Image Models

Dar-Shyang Lee,Raymond W. Smith

Improving Book OCR by Adaptive Language and Image Models

2012

Dar-Shyang Lee
Raymond W. Smith

In order to cope with the vast diversity of book content and typefaces, it is important for OCR systems to leverage the strong consistency within a book but adapt to variations across books. We describe a system that combines two parallel correction paths using document-specific image and language models. Each model adapts to shapes and vocabularies within a book to identify inconsistencies as correction hypotheses, but relies on the other for effective cross-validation. Using the open source Tesseract engine as baseline, results on a large data set of scanned books demonstrate that word error rates can be reduced by 25 percent using this approach.

Keywords:

Optical character recognition
Strong consistency
Speech recognition
Error detection and correction
Typeface
Tesseract
Cross-validation
Word error rate
Language model
Computer science
open source
Artificial intelligence
Information retrieval
document image processing
Natural language processing

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations