Segmentation and labeling of documents using conditional random fields

2007 
ABSTRACT The paper describes the use of Conditional Random Fields(CRF) utilizing contextual information in automati-cally labeling extracted segments of scanned documents as Machine-print, Handwriting and Noise. The result ofsuch a labeling can serve as an indexing step for a context-based image retrieval system or a bio-metric signatureveri“cation system. A simple region growing algorithm is “rst used to segment the document into a number ofpatches. A label for each such segmented patch is inferred using a CRF model. The model is ”exible enoughto include signatures as a type of handwriting and isolate it from machine-print and noise. The robustness ofthe model is due to the inherent nature of modeling neighboring spatial dependencies in the labels as well asthe observed data using CRF. Maximum pseudo-likelihood estimates for the parameters of the CRF model arelearnt using conjugate gradient descent. Inference of labels is done by computing the probability of the labelsunder the model with Gibbs sampling. Experimental results show that this approach provides for 95 .75% of thedata being assigned correct labels. The CRF based model is shown to be superior to Neural Networks and NaiveBayes.Keywords: Conditional Random Field(CRF); labeling scanned documents; handwritten text extraction
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    9
    References
    42
    Citations
    NaN
    KQI
    []