Entropy-based clustering for improving document re-ranking

Chong Teng Yanxiang He Donghong Ji Cheng Zhou Yixuan Geng Shu Chen

Citation

Reference

Related Paper

Citation Trend

Abstract:

Document re-ranking locates between initial retrieval and query expansion in information retrieval system. In this paper, we propose entropy-based clustering approach for document re-ranking. The value of within-cluster entropy determines whether two classes should be merged, and the value of between-cluster entropy determines how many clusters are reasonable. What to do next is finding a suitable cluster from clustering result to construct pseudo labeled document, and conduct document re-ranking as our previous method. We focus clustering strategy for documents after initial retrieval. Experiment with NTCIR-5 data show that the approach can improve the performance of initial retrieval, and it is helpful for improving the quality of document re-ranking.

Keywords:

Document Clustering

Topics:

Data Management and Algorithms

Image Retrieval and Classification Techniques

Advanced Text Analysis Techniques

10.1109/icicisys.2009.5358089

Cite

High‐speed rough clustering for very large document collections

Journal of the American Society for Information Science and Technology (2010)

Kazuaki Kishida

Abstract Document clustering is an important tool, but it is not yet widely used in practice probably because of its high computational complexity. This article explores techniques of high‐speed rough clustering of documents, assuming that it is sometimes necessary to obtain a clustering result in a shorter time, although the result is just an approximate outline of document clusters. A promising approach for such clustering is to reduce the number of documents to be checked for generating cluster vectors in the leader–follower clustering algorithm. Based on this idea, the present article proposes a modified Crouch algorithm and incomplete single‐pass leader–follower algorithm. Also, a two‐stage grouping technique, in which the first stage attempts to decrease the number of documents to be processed in the second stage by applying a quick merging technique, is developed. An experiment using a part of the Reuters corpus RCV1 showed empirically that both the modified Crouch and the incomplete single‐pass leader–follower algorithms achieve clustering results more efficiently than the original methods, and also improved the effectiveness of clustering results. On the other hand, the two‐stage grouping technique did not reduce the processing time in this experiment.

Document Clustering

10.1002/asi.21311

Cite

Citations (9)

Topical Concept Based Text Clustering Method

Advanced materials research (2012)

Yi Ding Xian Fu

Text clustering typically involves clustering in a high dimensional space, which appears difficult with regard to virtually all practical settings. In addition, given a particular clustering result it is typically very hard to come up with a good explanation of why the text clusters have been constructed the way they are. . To solve these problems, based on topic concept clustering, this paper proposes a method for Chinese document clustering. In this paper, we introduce a novel topical document clustering method called Document Features Indexing Clustering (DFIC), which can identify topics accurately and cluster documents according to these topics. In DFIC, “topic elements” are defined and extracted for indexing base clusters. Additionally, document features are investigated and exploited. Experimental results show that DFIC can gain a higher precision (92.76%) than some widely used traditional clustering methods.

Document Clustering

Clustering high-dimensional data

Brown clustering

Single-linkage clustering

Conceptual clustering

10.4028/www.scientific.net/amr.532-533.939

Cite

Citations (3)

Passage retrieval revisited

Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '02 (1997)

Marcin Kaszkiel Justin Zobel

Ranking based on passages addresses some of the shortcomings of whole-document ranking. It provides convenient units of text to return to the user, avoids the difficulties of comparing documents of different length, and enables identification of short blocks of relevant material amongst otherwise irrelevant text. In this paper we explore the potential of passage retrieval, based on an experimental evaluation of the ability of passages to identify relevant documents. We compare our scheme of arbitrary passage retrieval to several other document retrieval and passage retrieval methods; we show experimentally that, compared to these methods, ranking via fixed-length passages is robust and effective. Our experiments also show that, compared to whole-document ranking, ranking via fixed-length arbitrary passages significantly improves retrieval effectiveness, by 8% for TREC disks 2 and 4 and by 18%-37% for the Federal Register collection.

Identification

Learning to Rank

10.1145/258525.258561

Cite

Citations (258)

Recent trends in hierarchic document clustering: A critical review

Information Processing & Management (1988)

Peter Willett

Document Clustering

Hierarchical clustering

Similarity (geometry)

Complete linkage

10.1016/0306-4573(88)90027-1

Cite

Citations (725)

Seed-Guided Deep Document Clustering

Lecture notes in computer science (2020)

Mazar Moradi Fard Thibaut Thonet Éric Gaussier

Document Clustering

Brown clustering

Clustering high-dimensional data

10.1007/978-3-030-45439-5_1

Cite

Citations (3)

Comparative Study of Document Clustering Algorithms

International Journal of Engineering & Technology (2018)

Noratiqah Mohd Ariff Mohd Aftar Abu Bakar M. I. Rahmad

Text clustering is a data mining technique that is becoming more important in present studies. Document clustering makes use of text clustering to divide documents according to the various topics. The choice of words in document clustering is important to ensure that the document can be classified correctly. Three different methods of clustering which are hierarchical clustering, k-means and k-medoids are used and compared in this study in order to identify the best method which produce the best result in document clustering. The three methods are applied on 60 sports articles involving four different types of sports. The k-medoids clustering produced the worst result while k-means clustering is found to be more sensitive towards general words. Therefore, the method of hierarchical clustering is deemed more stable to produce a meaningful result in document clustering analysis.Â

Document Clustering

Brown clustering

Single-linkage clustering

Hierarchical clustering

Clustering high-dimensional data

Complete-linkage clustering

Conceptual clustering

Consensus clustering

10.14419/ijet.v7i4.11.20816

Cite

Citations (6)

Measurement of clustering effectiveness for document collections

Information Retrieval (2022)

Meng Yuan Justin Zobel Pauline Chou

Abstract Clustering of the contents of a document corpus is used to create sub-corpora with the intention that they are expected to consist of documents that are related to each other. However, while clustering is used in a variety of ways in document applications such as information retrieval, and a range of methods have been applied to the task, there has been relatively little exploration of how well it works in practice. Indeed, given the high dimensionality of the data it is possible that clustering may not always produce meaningful outcomes. In this paper we use a well-known clustering method to explore a variety of techniques, existing and novel, to measure clustering effectiveness. Results with our new, extrinsic techniques based on relevance judgements or retrieved documents demonstrate that retrieval-based information can be used to assess the quality of clustering, and also show that clustering can succeed to some extent at gathering together similar material. Further, they show that intrinsic clustering techniques that have been shown to be informative in other domains do not work for information retrieval. Whether clustering is sufficiently effective to have a significant impact on practical retrieval is unclear, but as the results show our measurement techniques can effectively distinguish between clustering methods.

Document Clustering

Relevance

Brown clustering

Clustering high-dimensional data

Conceptual clustering

10.1007/s10791-021-09401-8

Cite

Citations (8)

An Improved Sequential IB Algorithm for Document Clustering

Pattern Recognition and Artificial Intelligence (2008)

Ye Yang

Document Clustering

Information bottleneck method

Data stream clustering

Local optimum

Single-linkage clustering

Source

Cite

Citations (2)

Relevant document retrieval via discrete stochastic optimization

2021 18th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP) (2013)

Shuhuai Ren

In this paper, a relevant document retrieval method is proposed for document retrieval systems with vector space models (VSM). In recent years, with the size of the database becomes extremely large, there becomes a high demanding of an accurate and fast-time document retrieval algorithm. Based on the maximum similarity criterion, a document retrieval algorithm using the discrete stochastic optimization method is proposed with the user query to retrieve the relevant documents. The proposed algorithm has the self-learning capability for most of the computational effort is spent at the global optimal document and converges fast to the relevant documents in the database. Numerical results demonstrate that the proposed algorithm has a good convergence property and satisfied document retrieval performance in the database.

Document Clustering

Vector space model

Similarity (geometry)

10.1109/iccwamtip.2013.6716603

Cite

Citations (1)

A Study on Pseudo Labeled Document Constructed for Document Re-ranking

Chong Teng Yanxiang He Donghong Ji Guimin Lin Zhewei Mai

Document re-ranking is a middle module in information retrieval system. It's expected that more relevant documents with query appear in higher rankings, from which automatic query expansion can benefit, and it aims at improving the performance of the entire information retrieval. In this paper, we construct a pseudo labeled document based on pseudo-relevance feedback principle, and discuss about the relationship between performance of document re-ranking and the number of top documents in initial retrieval, the number of key terms from the top documents when constructing a pseudo labeled document. Experiment shows our approach of a pseudo labeled document constructed is greatly helpful to document re-ranking. It is the main contribution in the paper. Moreover, experiment shows the performance of document re-ranking is decreasing as the number of top documents increases; and increasing as the number of key terms from these documents increases.

Relevance

Relevance Feedback

10.1109/aici.2009.311

Cite

Citations (4)