RE-STRUCTURING HTML DOCUMENTS STRUCTURE AUTOMATICALLY THROUGH CLUSTERING

Sarwar Hadi,S. Qamar Abbas,Sheenu Rizvi

RE-STRUCTURING HTML DOCUMENTS STRUCTURE AUTOMATICALLY THROUGH CLUSTERING

2009

Sarwar Hadi
S. Qamar Abbas
Sheenu Rizvi

In this paper we present a novel approach to automatically re-structuring HTML documents by extracting semantic structures from their header and body, The body of a web page is generally software generated via template and it’s layout has a physical schema. Our approach is to extract trees that are based on hierarchical relations in HTML documents, for this task we used two algorithms, first is Header extraction Algorithm which extracts header trees from head of HTML document and second is an algorithm for automatically partitioning HTML documents into tree like semantic structures from body part of web pages. Then we use an application called layout changer which changes a layout of one web page to another by aligning extracted header trees and partition trees.

Keywords:

Physical schema
Software
Header
Web page
Document Structure Description
Information retrieval
Cluster analysis
Static web page
Computer science
Text mining
extraction algorithm
Structuring

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations