Clustering - What Both Theoreticians and Practitioners are Doing Wrong

arXiv (Cornell University) (2018)

Citation

Reference

Related Paper

Abstract:

Unsupervised learning is widely recognized as one of the most important challenges facing machine learning nowa- days. However, in spite of hundreds of papers on the topic being published every year, current theoretical understanding and practical implementations of such tasks, in particular of clustering, is very rudimentary. This note focuses on clustering. I claim that the most signif- icant challenge for clustering is model selection. In contrast with other common computational tasks, for clustering, dif- ferent algorithms often yield drastically different outcomes. Therefore, the choice of a clustering algorithm, and their pa- rameters (like the number of clusters) may play a crucial role in the usefulness of an output clustering solution. However, currently there exists no methodical guidance for clustering tool-selection for a given clustering task. Practitioners pick the algorithms they use without awareness to the implications of their choices and the vast majority of theory of clustering papers focus on providing savings to the resources needed to solve optimization problems that arise from picking some concrete clustering objective. Saving that pale in com- parison to the costs of mismatch between those objectives and the intended use of clustering results. I argue the severity of this problem and describe some recent proposals aiming to address this crucial lacuna.

Keywords:

Conceptual clustering

Consensus clustering

Constrained clustering

Implementation

Topics:

Advanced Clustering Algorithms Research

10.48550/arxiv.1805.08838

Cite

PDF

Clustering Techniques: A Brief Survey of Different Clustering Algorithms

Deepti Sisodia Lokesh Singh Sheetal Sisodia

Partitioning a set of objects into homogeneous clusters is a fundamental operation in data mining. The operation is needed in a number of data mining tasks. Clustering or data grouping is the key technique of the data mining. It is an unsupervised learning task where one seeks to identify a finite set of categories termed clusters to describe the data . The grouping of data into clusters is based on the principle of maximizing the intra class similarity and minimizing the inter class similarity. The goal of clustering is to determine the intrinsic grouping in a set of unlabeled data. But how to decide what constitutes a good clustering? This paper deal with the study of various clustering algorithms of data mining and it focus on the clustering basics, requirement, classification, problem and application area of the clustering algorithms.

Constrained clustering

Data stream clustering

Single-linkage clustering

Consensus clustering

Clustering high-dimensional data

Similarity (geometry)

Data set

Source

Cite

Citations (68)

Performance evaluation of hierarchical clustering algorithms

2010 International Conference on Communication and Computational Intelligence (INCOCCI) (2010)

E. Gothai P. Balasubramanie

Clustering, an supervised learning process is a challenging problem. Clustering result quality improves the overall structure. In this article, we propose an incremental stream of hierarchical clustering and improve the efficiency, reduce time consumption and accuracy of text categorization algorithm by forming an exact sub clustering. In this paper we propose a new method called multilevel clustering which a combination is of supervised and an unsupervised technique for form the clustering. In this method we form four levels of clustering. The proposed work uses the existing clustering algorithm. We develop and discuss algorithms for multilevel clustering method to achieve the best clustering experiment.

Data stream clustering

Single-linkage clustering

Conceptual clustering

Hierarchical clustering

Brown clustering

Constrained clustering

Consensus clustering

Clustering high-dimensional data

Source

Cite

Citations (3)

Review of Traditional and Ensemble Clustering Algorithms for High Dimensional Data

SSRN Electronic Journal (2018)

K. Kalaiselvi D. Karthika

High-dimensional data is explained by a huge quantity of features, introduces new issues to clustering. The so-named 'high dimensionality', creates initially to explain the common increase in time complexity of several computational issues, so the performances of the general clustering algorithms are unsuccessful. Accordingly, several works have been focused on introducing new techniques and clustering algorithms for handling higher dimensionality data. Regular to all clustering algorithms is the fact with the purpose of they need a various fundamental evaluation of similarity among data objects. However still, the existing clustering algorithms have some open research issues. In this review work, we provide a summary of the result of high-dimensional data space and their implications for various clustering algorithms. It also presents a detailed overview of many clustering algorithms with several types: subspace methods, modelbased clustering, density-based clustering methods; partition based clustering methods, etc., including a more detailed description of recent work of their own advantages and disadvantages for solving higher dimensionality data problem. The scope of the future work to extend the present clustering methods and algorithms are also discussed at end of the work.

Data stream clustering

Clustering high-dimensional data

Constrained clustering

Consensus clustering

10.2139/ssrn.3170321

Cite

Citations (0)

Unsupervised learning via iteratively constructed clustering ensemble

2022 International Joint Conference on Neural Networks (IJCNN) (2010)

Yun Yang Ke Chen

Unsupervised classification or clustering is an important data analysis technique demanded in various fields including machine learning, data mining, pattern recognition, image analysis and bioinformatics. Recently a large number of studies have attempted to improve clustering by combing multiple clustering solutions into a single consolidated clustering ensemble that has the best performance among given clustering solutions. However, the different clustering ensembles have their own behaviors on data of various characteristics. In this paper, we propose a novel approach to data clustering by constructing a clustering ensemble iteratively based on partitions generated on training subsets sampled from the original dataset. To yield a robust clustering ensemble our approach employs a hybrid sampling scheme inspired by both boosting and bagging techniques originally proposed for supervised learning. Our approach has been evaluated on synthetic data and real-world motion trajectory data sets, and experimental results demonstrate that it yields satisfactory performance for a variety of clustering tasks.

Consensus clustering

Ensemble Learning

Data stream clustering

Conceptual clustering

Boosting

Single-linkage clustering

Clustering high-dimensional data

10.1109/ijcnn.2010.5596577

Cite

Citations (6)

Clustering: science or art?

Ulrike von Luxburg Robert C. Williamson Isabelle Guyon

We examine whether the quality of different clustering algorithms can be compared by a general, scientifically sound procedure which is independent of particular clustering algorithms. We argue that the major obstacle is the difficulty in evaluating a clustering algorithm without taking into account the context: why does the user cluster his data in the first place, and what does he want to do with the clustering afterwards? We argue that clustering should not be treated as an application-independent mathematical problem, but should always be studied in the context of its end-use. Different techniques to evaluate clustering algorithms have to be developed for different uses of clustering. To simplify this procedure we argue that it will be useful to build a taxonomy of clustering problems to identify clustering applications which can be treated in a unified way and that such an effort will be more fruitful than attempting the impossible--developing optimal domain-independent clustering algorithms or even classifying clustering algorithms in terms of how they work.

Constrained clustering

Data stream clustering

Conceptual clustering

Brown clustering

Clustering high-dimensional data

Consensus clustering

Source

Cite

Citations (69)

A Post Web Document Clustering Algorithm

KIPS Transactions on Computer and Communication Systems (2002)

Young Hee Im

The Post-clustering algorithms, which cluster the results of Web search engine, have several different requirements from conventional clustering algorithms. In this paper, we propose the new post-clustering algorithm satisfying those requirements as many as possible. The proposed Concept ART is the form of combining the concept vector that have several advantages in document clustering with Fuzzy ART known as real-time clustering algorithms. Moreover we show that it is applicable to general-purpose clustering as well as post-clustering

Data stream clustering

Single-linkage clustering

Conceptual clustering

Constrained clustering

Brown clustering

Document Clustering

Clustering high-dimensional data

Source

Cite

Citations (0)

Subspace Clustering, Ensemble Clustering, Alternative Clustering, Multiview Clustering: What Can We Learn From Each Other?

Hans‐Peter Kriegel Arthur Zimek Ludwig-Maximilians-Universität München Subspace Clustering

Though subspace clustering, ensemble clustering, alternative clustering, and multiview clustering are different approaches motivated by different problems and aiming at different goals, there are similar problems in these fields. Here we shortly survey these areas from the point of view of subspace clustering. Based on this survey, we try to identify problems where the different research areas could probably learn from each other.

Consensus clustering

Single-linkage clustering

Data stream clustering

Clustering high-dimensional data

Constrained clustering

k-medians clustering

Brown clustering

Source

Cite

Citations (17)

Data Clustering: User’s Dilemma

Lecture notes in computer science (2007)