A Robust and Automated Approach for Multilingual Indian Document Indexing

2019 
Currently, several Indian government offices lack a robust software for searching words from the scanned multilingual Indian documents. Manually searching such documents is tedious and time-consuming. Moreover, there will be a large number of such documents to be searched for the desired contents. Thus, there is a pressing need for robust automatic search software for multilingual Indian aged documents, where there is no single robust Optical Character Recognition (OCR) system existing to recognize the complex Indian scripts. Towards this end, we propose to group the components belonging to a text line of a document with multiple orientations using a new geometrical approach and an extended profile feature extraction technique for character recognition of printed Indian documents. The performance of the proposed approach is evaluated on variety of Indian documents with English characters and Devanagari scripts. Experimental results suggests that the proposed approach generates the accurate index words for most of the document images used in this study. Moreover, the proposed technique saves both time and efforts compared with the manual indexing of document images.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    12
    References
    0
    Citations
    NaN
    KQI
    []