Combination of Structural and Factual Descriptors for Document Stream Segmentation

2016 
This paper extends a previous work being done by [4]. Having no information about the document separation in the flow, the system operates progressively by examining successive pairs of pages looking for continuity or rupture descriptors. Four document levels have been introduced to better extract those descriptors and reduce the ambiguity in their extraction: records, technical documents, fundamental documents and cases. At each level, structural and factual descriptors are first extracted and then compared between pairs of pages or documents. To reinforce the descriptor interest and focus the system on equivalent descriptors in the pairs, the descriptors are accompanied by their context. The extraction of the context is facilitated by the determination of the physical and logical structure in the pages. Contextual rules based on these descriptors are employed for the determination of either a continuity, a rupture or an uncertainty between the pairs. To overcome the problem of information emptiness in the current page, a logbook is used to gather the descriptors in all the previous pages of the record and a buffer allows to delay the comparison. These latter points were added to the previous work that widely reinforce the current system increasing its precision of more than 6%.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    10
    References
    3
    Citations
    NaN
    KQI
    []