Automatic Tracing and Extraction of Text-Line and Word Segments Directly in JPEG Compressed Document Images

2020 
JPEG is one of the popular and efficient compression algorithms supported in the consumer electronics world. Excessive usage of mobile phones and e-governance applications have all resulted in a huge collection of JPEG compressed document images. The major challenge with these images is that its processing becomes expensive as it requires repeated decompression and recompression operations. Recently, it has been proved that developing algorithms to operate directly on the compressed data is one of the solutions in overcoming the above issue. This research study investigates a novel algorithm for segmentation of text-lines and words directly from JPEG compressed handwritten document images. Segmenting a handwritten document is challenging due to the presence of uneven spacing, variable font sizes, overlapping and touching components, and it becomes much more challenging if it is to be done directly in the compressed image. The proposed technique virtually fixes a vertical stripe at the beginning of the document to detect starting points of text-lines. Then a moving window-based space penetration algorithm is used for tracing the exact line boundary between two text-lines, resolving the issues of space and font variations, touching and overlapping components. Subsequently, a word boundary tracing algorithm is used to segment words.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    33
    References
    2
    Citations
    NaN
    KQI
    []