Retrieving and organizing web pages by “information unit”

Wen-Syan Li,K. Selçuk Candan,Quoc Vu,Divyakant Agrawal

Retrieving and organizing web pages by “information unit”

2001

Since WWW encourages hypertext and hypermedia document authoring (e.g., HTML or XML), Web authors tend to create documents that are composed of multiple pages connected with hyperlinks or frames. A Web document may be authored in multiple ways, such as (1) all information in one physical page, or (2) a main page and the related information in separate linked pages. Existing Web search engines, however, return only physical pages. In this paper, we introduce and describe the use of the concept of information unit, which can be viewed as a logical Web document consisting of multiple physical pages as one atomic retrieval unit. We present an algorithm to eAEciently retrieve information units. Our algorithm can perform progressive query processing over a Web index by considering both document semantic similarity and link structures. Experimental results on synthetic graphs and real Web data show the effectiveness and usefulness of the proposed information unit retrieval technique.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

110

Citations