Building Data Sets for Indian Language OCR Research

C. V. Jawahar,Anand Kumar,A. Phaneendra,K.J. Jinesh

Building Data Sets for Indian Language OCR Research

2009

C. V. Jawahar
Anand Kumar
A. Phaneendra
K.J. Jinesh

Lack of resources in the form of annotated data sets has been one of the hurdles in developing robust document understanding systems for Indian languages. In this chapter, we present our activities in this direction. Our corpus consists of more than 600000 document images in Indian scripts. A parallel text is aligned to the images to obtain word- and symbol-level annotated data sets. We describe the process we follow and the status of the activities.

Keywords:

Scripting language
Natural language processing
Data set
Data mining
Artificial intelligence
Computer science
indian scripts
indian language

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations