Integrating natural language processing with image document analysis: what we learned from two real-world applications
2015
Automatically accessing information from unconstrained image documents has important applications in business and government operations. These real-world applications typically combine optical character recognition (OCR) with language and information technologies, such as machine translation (MT) and keyword spotting. OCR output has errors and presents unique challenges to late-stage processing. This paper addresses two of these challenges: (1) translating the output from Arabic handwriting OCR which lacks reliable sentence boundary markers, and (2) searching named entities which do not exist in the OCR vocabulary, therefore, completely missing from Arabic handwriting OCR output. We address these challenges by leveraging natural language processing technologies, specifically conditional random field-based sentence boundary detection and out-of-vocabulary (OOV) name detection. This approach significantly improves our state-of-the-art MT system and achieves MT scores close to that achieved by human segmentation. The output from OOV name detection was used as a novel feature for discriminative reranking, which significantly reduced the false alarm rate of OOV name search on OCR output. Our experiments also show substantial performance gains from integrating a variety of features from multiple resources, such as linguistic analysis, image layout analysis, and image text recognition.
Keywords:
- Keyword spotting
- Computer science
- Machine learning
- Machine translation
- Pattern recognition
- Handwriting
- Optical character recognition
- Artificial intelligence
- Information retrieval
- Conditional random field
- Information technology
- Natural language processing
- Sentence
- Vocabulary
- Constant false alarm rate
- Discriminative model
- Speech recognition
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
48
References
7
Citations
NaN
KQI