Subject-specific search facilities on health sites are usually built using manual inclusion and exclusion rules. These can be expensive to maintain and often provide incomplete coverage of Web resources. On the other hand, health information obtained through whole-of-Web search may not be scientifically based and can be potentially harmful.To address problems of cost, coverage and quality, we built a focused crawler for the mental health topic of depression, which was able to selectively fetch higher quality relevant information. We found that the relevance of unfetched pages can be predicted based on link anchor context, but the quality cannot. We therefore estimated quality of the entire linking page, using a learned IR-style query of weighted single words and word pairs, and used this to predict the quality of its links. The overall crawler priority was determined by the product of link relevance and source quality.We evaluated our crawler against baseline crawls using both relevance judgments and objective site quality scores obtained using an evidence-based rating scale. Both a relevance focused crawler and the quality focused crawler retrieved twice as many relevant pages as a breadth-first control. The quality focused crawler was quite effective in reducing the amount of low quality material fetched while crawling more high quality content, relative to the relevance focused crawler.Analysis suggests that quality of content might be improved by post-filtering a very big breadth-first crawl, at the cost of substantially increased network traffic.
Chapters 3–6 present alternative approaches to each of the major corpus modeling dimensions. These approaches vary in their ability to faithfully model a corpus; some of the approaches can be more or less faithful depending upon settings such as the number ofsegments in a piecewise linear model. There is a clear need to devise suitable evaluation methodologies for comparing different approaches and for measuring how faithfully an emulated collection matches the corresponding real one.
In this chapter, we assess the validity of synthetic test collections, constructed using methods we have described, in IR experimentation. To what extent do the timing, resource usage, and effectiveness results obtainable using synthetic data predict those we would get with real data? We also explore the trade-off between emulation fidelity and confidentiality.
Hyperlink recommendation evidence, that is, evidence based on the structure of a web's link graph, is widely exploited by commercial Web search systems. However there is little published work to support its popularity. Another form of query-independent evidence, URL-type, has been shown to be beneficial on a home page finding task. We compared the usefulness of these types of evidence on the home page finding task, combined with both content and anchor text baselines. Our experiments made use of five query sets spanning three corpora---one enterprise crawl, and the WT10g and VLC2 Web test collections.We found that, in optimal conditions, all of the query-independent methods studied (in-degree, URL-type, and two variants of PageRank) offered a better than random improvement on a content-only baseline. However, only URL-type offered a better than random improvement on an anchor text baseline. In realistic settings, for either baseline, only URL-type offered consistent gains. In combination with URL-type the anchor text baseline was more useful for finding popular home pages, but URL-type with content was more useful for finding randomly selected home pages. We conclude that a general home page finding system should combine evidence from document content, anchor text, and URL-type classification.
This year’s main experiment involved processing a mixed query stream, with an even mix of each query type studied in TREC-2003: 75 homepage finding queries, 75 named page finding queries and 75 topic distillation queries. The goal was to find ranking approaches which work well over the 225 queries, without access to query type labels. We also ran two small experiments. First, participants were invited to submit classification runs, attempting to correctly label the 225 queries by type. Second, we invited participants to download the new W3C test collection, and think about appropriate experiments for the proposed TREC-2005 Enterprise Track. This is the last year for the Web Track in its current form, it will not run in TREC-2005.
Concerted research effort since the nineteen fifties has lead to effective methods for retrieval of relevant documents from homogeneous collections of text, such as newspaper archives, scientific abstracts and CD-ROM encyclopaedias. However, the triumph of the Web in the nineteen nineties forced a significant paradigm shift in the Information Retrieval field because of the need to address the issues of enormous scale, fluid collection definition, great heterogeneity, unfettered interlinking, democratic publishing, the presence of adversaries and most of all the diversity of purposes for which Web search may be used. Now, the IR field is confronted with a challenge of similarly daunting dimensions -- how to bring highly effective search to the complex information spaces within enterprises. Overcoming the challenge would bring massive economic benefit, but victory is far from assured. The present work characterises enterprise search, hints at its economic magnitude, states some of the unsolved research questions in the domain of enterprise search need, proposes an enterprise search test collection and presents results for a small but interesting sub-problem.
Web search engines crawl the web to fetch the data that they index. In this paper we re-examine that need, and evaluate the network costs associated with data acquisition, and alternative ways in which a search service might be supported. As a concrete example, we make use of the Research Finder search service provided at http://rf.panopticsearch.com, and information derived from its crawl and query logs. Based upon an analysis of the Research Finder system we introduce a hybrid arrangement, in which queries are evaluated partially by reference to a centrally maintained index representing a subset of the collection, and partially by referring them on to the local search services maintained by the balance of the collection. We also examine various ways in which crawling costs can be reduced.
The presentation of query biased document snippets as part of results pages presented by search engines has become an expectation of search engine users. In this paper we explore the algorithms and data structures required as part of a search engine to allow efficient generation of query biased snippets. We begin by proposing and analysing a document compression method that reduces snippet generation time by 58% over a baseline using the zlib compression library. These experiments reveal that finding documents on secondary storage dominates the total cost of generating snippets, and so caching documents in RAM is essential for a fast snippet generation process. Using simulation, we examine snippet generation performance for different size RAM caches. Finally we propose and analyse document reordering and compaction, revealing a scheme that increases the number of document cache hits with only a marginal affect on snippet quality. This scheme effectively doubles the number of documents that can fit in a fixed size cache.
The SynthaCorpus methods for generating words (Chapters 4 and 5) are concerned with generating a series of integers Ri representing the ranks of the words in a Zipf-style ordering. If we are given a lexicon (for example the lexicon of a corpus being emulated), we can convert each Ri to a string by simple look-up. If we have no lexicon and no interest in the actual textual rep-resentations, we can emit strings such as "t27," representing the 27th most frequently occurring word.