Sampling, information extraction and summarisation of hidden web databases

Yih-Ling Hedley,Muhammad Younas,Anne E. James,Mark Sanderson

Sampling, information extraction and summarisation of hidden web databases

2006

Yih-Ling Hedley
Muhammad Younas
Anne E. James
Mark Sanderson

Hidden Web databases maintain a collection of specialised documents, which are dynamically generated using page templates. This paper presents the Two-Phase Sampling (2PS) technique that detects and extracts query-related information from documents contained in databases. 2PS is based on a two-phase framework for the sampling, information extraction and summarisation of Hidden Web documents. In the first phase, 2PS samples and stores documents for further analysis. In the second phase, it detects Web page templates from sampled documents and extracts relevant information from which a content summary is then generated. Experimental results demonstrate that 2PS effectively eliminates irrelevant information from sampled documents and generates terms and frequencies with improved accuracy.

Keywords:

Web page
Information retrieval
Database
Data mining
Data Web
Systems design
Computer science
Information extraction
Deep Web
Static web page
Sampling (statistics)
World Wide Web
relevant information

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations