Adapting the Tesseract open source OCR engine for multilingual OCR

Raymond W. Smith,Daria Antonova,Dar-Shyang Lee

Adapting the Tesseract open source OCR engine for multilingual OCR

2009

Raymond W. Smith
Daria Antonova
Dar-Shyang Lee

We describe efforts to adapt the Tesseract open source OCR engine for multiple scripts and languages. Effort has been concentrated on enabling generic multi-lingual operation such that negligible customization is required for a new language beyond providing a corpus of text. Although change was required to various modules, including physical layout analysis, and linguistic post-processing, no change was required to the character classifier beyond changing a few limits. The Tesseract classifier has adapted easily to Simplified Chinese. Test results on English, a mixture of European languages, and Russian, taken from a random sample of books, show a reasonably consistent word error rate between 3.72% and 5.78%, and Simplified Chinese has a character error rate of only 3.77%.

Keywords:

Scripting language
Constructed language
Natural language processing
Classifier (linguistics)
Speech recognition
Tesseract
Word error rate
Personalization
Computer science
Artificial intelligence
open source

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations