RE-STRUCTURING HTML DOCUMENTS STRUCTURE AUTOMATICALLY THROUGH CLUSTERING

2009 
In this paper we present a novel approach to automatically re-structuring HTML documents by extracting semantic structures from their header and body, The body of a web page is generally software generated via template and it’s layout has a physical schema. Our approach is to extract trees that are based on hierarchical relations in HTML documents, for this task we used two algorithms, first is Header extraction Algorithm which extracts header trees from head of HTML document and second is an algorithm for automatically partitioning HTML documents into tree like semantic structures from body part of web pages. Then we use an application called layout changer which changes a layout of one web page to another by aligning extracted header trees and partition trees.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    5
    References
    0
    Citations
    NaN
    KQI
    []