Semi-structured document extraction based on document element block model

2016 
A large number of documents related to its specific business are produced continually by enterprises and institutions in their daily work. To get useful information from these semi-structured documents we have proposed document element block model(DEBM) and applied it in the semi-structured document extraction. The model makes full use of the information contains in the document, not only the structural information, but also the content. DEBM extracts document element block from template documents and target documents, and then generate corresponding regular expression rules based on the document element block characteristic of template document, after that process each type of document elements of a set of blocks extracted document elements according to the corresponding elements block position by regular expression matching. The experiments show that extraction based on DEBM achieved good results and compared to traditional regular expressions and template matching, the approach based on DEBM performs better. The result shows that we propose a simple, efficient model to extract semi-structured documents.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    2
    References
    0
    Citations
    NaN
    KQI
    []