Feature Selection for Document Flow Segmentation

2018 
In this paper, we describe a method to restore a flow of continuous documents. The flow is a collection of consecutive scanned pages without explicit separation marks between documents. Our method is based on contextual and layout descriptors meant to specify the relationship between each pair of consecutive pages. The relationships are represented using vectors of features with boolean values indicating the presence or the absence of descriptors on concerned pages. The segmentation task therefore consists in classifying such vectors into continuities or breaks. The continuity class indicates that pages belong to the same document while the break class ends the ongoing document and starts a new one. The experimental part is based on a large collection of real administrative documents.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    15
    References
    2
    Citations
    NaN
    KQI
    []