logo
    Entropy-based clustering for improving document re-ranking
    1
    Citation
    15
    Reference
    10
    Related Paper
    Citation Trend
    Abstract:
    Document re-ranking locates between initial retrieval and query expansion in information retrieval system. In this paper, we propose entropy-based clustering approach for document re-ranking. The value of within-cluster entropy determines whether two classes should be merged, and the value of between-cluster entropy determines how many clusters are reasonable. What to do next is finding a suitable cluster from clustering result to construct pseudo labeled document, and conduct document re-ranking as our previous method. We focus clustering strategy for documents after initial retrieval. Experiment with NTCIR-5 data show that the approach can improve the performance of initial retrieval, and it is helpful for improving the quality of document re-ranking.
    Keywords:
    Document Clustering
    Abstract Document clustering is an important tool, but it is not yet widely used in practice probably because of its high computational complexity. This article explores techniques of high‐speed rough clustering of documents, assuming that it is sometimes necessary to obtain a clustering result in a shorter time, although the result is just an approximate outline of document clusters. A promising approach for such clustering is to reduce the number of documents to be checked for generating cluster vectors in the leader–follower clustering algorithm. Based on this idea, the present article proposes a modified Crouch algorithm and incomplete single‐pass leader–follower algorithm. Also, a two‐stage grouping technique, in which the first stage attempts to decrease the number of documents to be processed in the second stage by applying a quick merging technique, is developed. An experiment using a part of the Reuters corpus RCV1 showed empirically that both the modified Crouch and the incomplete single‐pass leader–follower algorithms achieve clustering results more efficiently than the original methods, and also improved the effectiveness of clustering results. On the other hand, the two‐stage grouping technique did not reduce the processing time in this experiment.
    Document Clustering
    Citations (9)
    Text clustering typically involves clustering in a high dimensional space, which appears difficult with regard to virtually all practical settings. In addition, given a particular clustering result it is typically very hard to come up with a good explanation of why the text clusters have been constructed the way they are. . To solve these problems, based on topic concept clustering, this paper proposes a method for Chinese document clustering. In this paper, we introduce a novel topical document clustering method called Document Features Indexing Clustering (DFIC), which can identify topics accurately and cluster documents according to these topics. In DFIC, “topic elements” are defined and extracted for indexing base clusters. Additionally, document features are investigated and exploited. Experimental results show that DFIC can gain a higher precision (92.76%) than some widely used traditional clustering methods.
    Document Clustering
    Clustering high-dimensional data
    Brown clustering
    Single-linkage clustering
    Conceptual clustering
    Ranking based on passages addresses some of the shortcomings of whole-document ranking. It provides convenient units of text to return to the user, avoids the difficulties of comparing documents of different length, and enables identification of short blocks of relevant material amongst otherwise irrelevant text. In this paper we explore the potential of passage retrieval, based on an experimental evaluation of the ability of passages to identify relevant documents. We compare our scheme of arbitrary passage retrieval to several other document retrieval and passage retrieval methods; we show experimentally that, compared to these methods, ranking via fixed-length passages is robust and effective. Our experiments also show that, compared to whole-document ranking, ranking via fixed-length arbitrary passages significantly improves retrieval effectiveness, by 8% for TREC disks 2 and 4 and by 18%-37% for the Federal Register collection.
    Identification
    Learning to Rank
    Citations (258)
    Text clustering is a data mining technique that is becoming more important in present studies. Document clustering makes use of text clustering to divide documents according to the various topics. The choice of words in document clustering is important to ensure that the document can be classified correctly. Three different methods of clustering which are hierarchical clustering, k-means and k-medoids are used and compared in this study in order to identify the best method which produce the best result in document clustering. The three methods are applied on 60 sports articles involving four different types of sports. The k-medoids clustering produced the worst result while k-means clustering is found to be more sensitive towards general words. Therefore, the method of hierarchical clustering is deemed more stable to produce a meaningful result in document clustering analysis.Â
    Document Clustering
    Brown clustering
    Single-linkage clustering
    Hierarchical clustering
    Clustering high-dimensional data
    Complete-linkage clustering
    Conceptual clustering
    Consensus clustering
    Abstract Clustering of the contents of a document corpus is used to create sub-corpora with the intention that they are expected to consist of documents that are related to each other. However, while clustering is used in a variety of ways in document applications such as information retrieval, and a range of methods have been applied to the task, there has been relatively little exploration of how well it works in practice. Indeed, given the high dimensionality of the data it is possible that clustering may not always produce meaningful outcomes. In this paper we use a well-known clustering method to explore a variety of techniques, existing and novel, to measure clustering effectiveness. Results with our new, extrinsic techniques based on relevance judgements or retrieved documents demonstrate that retrieval-based information can be used to assess the quality of clustering, and also show that clustering can succeed to some extent at gathering together similar material. Further, they show that intrinsic clustering techniques that have been shown to be informative in other domains do not work for information retrieval. Whether clustering is sufficiently effective to have a significant impact on practical retrieval is unclear, but as the results show our measurement techniques can effectively distinguish between clustering methods.
    Document Clustering
    Relevance
    Brown clustering
    Clustering high-dimensional data
    Conceptual clustering
    Document Clustering
    Information bottleneck method
    Data stream clustering
    Local optimum
    Single-linkage clustering
    Citations (2)
    In this paper, a relevant document retrieval method is proposed for document retrieval systems with vector space models (VSM). In recent years, with the size of the database becomes extremely large, there becomes a high demanding of an accurate and fast-time document retrieval algorithm. Based on the maximum similarity criterion, a document retrieval algorithm using the discrete stochastic optimization method is proposed with the user query to retrieve the relevant documents. The proposed algorithm has the self-learning capability for most of the computational effort is spent at the global optimal document and converges fast to the relevant documents in the database. Numerical results demonstrate that the proposed algorithm has a good convergence property and satisfied document retrieval performance in the database.
    Document Clustering
    Vector space model
    Similarity (geometry)
    Document re-ranking is a middle module in information retrieval system. It's expected that more relevant documents with query appear in higher rankings, from which automatic query expansion can benefit, and it aims at improving the performance of the entire information retrieval. In this paper, we construct a pseudo labeled document based on pseudo-relevance feedback principle, and discuss about the relationship between performance of document re-ranking and the number of top documents in initial retrieval, the number of key terms from the top documents when constructing a pseudo labeled document. Experiment shows our approach of a pseudo labeled document constructed is greatly helpful to document re-ranking. It is the main contribution in the paper. Moreover, experiment shows the performance of document re-ranking is decreasing as the number of top documents increases; and increasing as the number of key terms from these documents increases.
    Relevance
    Relevance Feedback
    Citations (4)