Integrating natural language processing with image document analysis: what we learned from two real-world applications

Jinying Chen,Huaigu Cao,Premkumar Natarajan

Integrating natural language processing with image document analysis: what we learned from two real-world applications

2015

Automatically accessing information from unconstrained image documents has important applications in business and government operations. These real-world applications typically combine optical character recognition (OCR) with language and information technologies, such as machine translation (MT) and keyword spotting. OCR output has errors and presents unique challenges to late-stage processing. This paper addresses two of these challenges: (1) translating the output from Arabic handwriting OCR which lacks reliable sentence boundary markers, and (2) searching named entities which do not exist in the OCR vocabulary, therefore, completely missing from Arabic handwriting OCR output. We address these challenges by leveraging natural language processing technologies, specifically conditional random field-based sentence boundary detection and out-of-vocabulary (OOV) name detection. This approach significantly improves our state-of-the-art MT system and achieves MT scores close to that achieved by human segmentation. The output from OOV name detection was used as a novel feature for discriminative reranking, which significantly reduced the false alarm rate of OOV name search on OCR output. Our experiments also show substantial performance gains from integrating a variety of features from multiple resources, such as linguistic analysis, image layout analysis, and image text recognition.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations