Targeted Optical Character Recognition: Classification Using Capsule Network.

2019 
Optical Character Recognition (OCR) is a process of digitizing an image or document containing text in a machine-readable format. In this paper, we are focusing on targeting only the numeric part with a few special characters in the tables. Many firms dealing in financial information would want to parse data from scanned tables and in some cases, they do not focus on the row labels as they might not change a lot. Only focusing on numeric information may also provide language independence to such firms that deal with documents written in a variety of languages. They can have foreign language experts who can just read row labels and have the OCR extract the numeric data. This makes their collection processes fast. We developed a targeted OCR to save time by processing only important characters and it can also overcome erroneous predictions in case of under segmentation of characters. In this paper, we propose a novel approach which segments the document into blocks of text (each line or word into one block) and classifies each block as numeric or non-numeric using a binary CNN. The process of character level segmentation and classification using capsule networks is then applied only to the blocks which are classified as numeric by the binary CNN.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    9
    References
    0
    Citations
    NaN
    KQI
    []