Building Data Sets for Indian Language OCR Research

2009 
Lack of resources in the form of annotated data sets has been one of the hurdles in developing robust document understanding systems for Indian languages. In this chapter, we present our activities in this direction. Our corpus consists of more than 600000 document images in Indian scripts. A parallel text is aligned to the images to obtain word- and symbol-level annotated data sets. We describe the process we follow and the status of the activities.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    17
    References
    6
    Citations
    NaN
    KQI
    []