Retrieving and organizing web pages by “information unit”

2001 
Since WWW encourages hypertext and hypermedia document authoring (e.g., HTML or XML), Web authors tend to create documents that are composed of multiple pages connected with hyperlinks or frames. A Web document may be authored in multiple ways, such as (1) all information in one physical page, or (2) a main page and the related information in separate linked pages. Existing Web search engines, however, return only physical pages. In this paper, we introduce and describe the use of the concept of information unit, which can be viewed as a logical Web document consisting of multiple physical pages as one atomic retrieval unit. We present an algorithm to eAEciently retrieve information units. Our algorithm can perform progressive query processing over a Web index by considering both document semantic similarity and link structures. Experimental results on synthetic graphs and real Web data show the effectiveness and usefulness of the proposed information unit retrieval technique.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    19
    References
    110
    Citations
    NaN
    KQI
    []