Document transformation system from papers to XML data based on pivot XML document method

2003 
This paper proposes a new method for document transformation using OCR to generate various XML documents from printed documents. The proposed method adopts a hierarchical transformation strategy based on a pivot XML document. Firstly, document elements such as title, authors, abstract, headings, paragraphs, lists, captions, tables and figures are extracted from document images. Secondly, the hierarchical structure of document elements is extracted and is described using a DOM tree. Thirdly, this document structure is converted into a pivot XML document described as an XHTML document by an XML parser. Finally, this pivot XML document is transformed into the target XML document by the XML parser with XSLT scripts or specific programs. Experimental results show the method is effective in transforming printed documents to various XML documents.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []