Generating vector spaces on-the-fly for flexible xml retrieval

Torsten Grabs,Hans-Jörg Schek

Generating vector spaces on-the-fly for flexible xml retrieval

2002

While documents are flat with conventional information retrieval, i.e., they are unstructured information, this is no longer adequate with semistructured data such as XML for two reasons. First, XML allows to hierarchically structure information within a document such that each document has tree structure. Users in turn want to refer to this structure when searching for relevant information. To do so, users pose so-called content-and-structure queries. Such queries refer to the document structure, e.g., by restricting the context of interest to some XML elements. Relevance ranking consequently has to properly reflect both document structure and the constraints that the query poses on the structure. Namely, the contents at the different levels of the tree are considered of different importance for a query. The intuition behind this is that content that is more distant in the document tree is less important than the one that is close to the context node. We subsequently denote this concept as nested retrieval, and it is a crucial requirement for meaningful retrieval from XML documents. Fuhr et al. tackle this issue by a technique denoted as augmentation [2, 3]. The idea is to introduce so-called augmentation weights that downweigh statistics such as inverted document frequencies of terms when the terms are propagated upwards in the document tree. To do so, Fuhr et al. [3] group XML element types to so-called indexing nodes that implement the inverted lists for efficient retrieval. They constitute the granularity of retrieval with their approach, i.e., indexes and statistics such as document frequencies are derived separately per indexing node. Users can search at the granularity of the indexing nodes and hierarchical combinations of them if indexing nodes are along the same path in the document. Term weights are properly augmented in this case. The drawback of the approach is that the assignment of XML element types to indexing nodes is static. Hence, users cannot retrieve dynamically, i.e., at query time, from arbitrary combinations of element types. The second reason why conventional retrieval techniques do not suffice for XML retrieval is that even a single XML document may have very heterogeneous content. Take for instance

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations