logo
    Clustering - What Both Theoreticians and Practitioners are Doing Wrong
    0
    Citation
    0
    Reference
    10
    Related Paper
    Abstract:
    Unsupervised learning is widely recognized as one of the most important challenges facing machine learning nowa- days. However, in spite of hundreds of papers on the topic being published every year, current theoretical understanding and practical implementations of such tasks, in particular of clustering, is very rudimentary. This note focuses on clustering. I claim that the most signif- icant challenge for clustering is model selection. In contrast with other common computational tasks, for clustering, dif- ferent algorithms often yield drastically different outcomes. Therefore, the choice of a clustering algorithm, and their pa- rameters (like the number of clusters) may play a crucial role in the usefulness of an output clustering solution. However, currently there exists no methodical guidance for clustering tool-selection for a given clustering task. Practitioners pick the algorithms they use without awareness to the implications of their choices and the vast majority of theory of clustering papers focus on providing savings to the resources needed to solve optimization problems that arise from picking some concrete clustering objective. Saving that pale in com- parison to the costs of mismatch between those objectives and the intended use of clustering results. I argue the severity of this problem and describe some recent proposals aiming to address this crucial lacuna.
    Keywords:
    Conceptual clustering
    Consensus clustering
    Constrained clustering
    Implementation
    Partitioning a set of objects into homogeneous clusters is a fundamental operation in data mining. The operation is needed in a number of data mining tasks. Clustering or data grouping is the key technique of the data mining. It is an unsupervised learning task where one seeks to identify a finite set of categories termed clusters to describe the data . The grouping of data into clusters is based on the principle of maximizing the intra class similarity and minimizing the inter class similarity. The goal of clustering is to determine the intrinsic grouping in a set of unlabeled data. But how to decide what constitutes a good clustering? This paper deal with the study of various clustering algorithms of data mining and it focus on the clustering basics, requirement, classification, problem and application area of the clustering algorithms.
    Constrained clustering
    Data stream clustering
    Single-linkage clustering
    Consensus clustering
    Clustering high-dimensional data
    Similarity (geometry)
    Data set
    Citations (68)
    Clustering, an supervised learning process is a challenging problem. Clustering result quality improves the overall structure. In this article, we propose an incremental stream of hierarchical clustering and improve the efficiency, reduce time consumption and accuracy of text categorization algorithm by forming an exact sub clustering. In this paper we propose a new method called multilevel clustering which a combination is of supervised and an unsupervised technique for form the clustering. In this method we form four levels of clustering. The proposed work uses the existing clustering algorithm. We develop and discuss algorithms for multilevel clustering method to achieve the best clustering experiment.
    Data stream clustering
    Single-linkage clustering
    Conceptual clustering
    Hierarchical clustering
    Brown clustering
    Constrained clustering
    Consensus clustering
    Clustering high-dimensional data
    Citations (3)
    High-dimensional data is explained by a huge quantity of features, introduces new issues to clustering. The so-named 'high dimensionality', creates initially to explain the common increase in time complexity of several computational issues, so the performances of the general clustering algorithms are unsuccessful. Accordingly, several works have been focused on introducing new techniques and clustering algorithms for handling higher dimensionality data. Regular to all clustering algorithms is the fact with the purpose of they need a various fundamental evaluation of similarity among data objects. However still, the existing clustering algorithms have some open research issues. In this review work, we provide a summary of the result of high-dimensional data space and their implications for various clustering algorithms. It also presents a detailed overview of many clustering algorithms with several types: subspace methods, modelbased clustering, density-based clustering methods; partition based clustering methods, etc., including a more detailed description of recent work of their own advantages and disadvantages for solving higher dimensionality data problem. The scope of the future work to extend the present clustering methods and algorithms are also discussed at end of the work.
    Data stream clustering
    Clustering high-dimensional data
    Constrained clustering
    Consensus clustering
    Citations (0)
    Unsupervised classification or clustering is an important data analysis technique demanded in various fields including machine learning, data mining, pattern recognition, image analysis and bioinformatics. Recently a large number of studies have attempted to improve clustering by combing multiple clustering solutions into a single consolidated clustering ensemble that has the best performance among given clustering solutions. However, the different clustering ensembles have their own behaviors on data of various characteristics. In this paper, we propose a novel approach to data clustering by constructing a clustering ensemble iteratively based on partitions generated on training subsets sampled from the original dataset. To yield a robust clustering ensemble our approach employs a hybrid sampling scheme inspired by both boosting and bagging techniques originally proposed for supervised learning. Our approach has been evaluated on synthetic data and real-world motion trajectory data sets, and experimental results demonstrate that it yields satisfactory performance for a variety of clustering tasks.
    Consensus clustering
    Ensemble Learning
    Data stream clustering
    Conceptual clustering
    Boosting
    Single-linkage clustering
    Clustering high-dimensional data
    We examine whether the quality of different clustering algorithms can be compared by a general, scientifically sound procedure which is independent of particular clustering algorithms. We argue that the major obstacle is the difficulty in evaluating a clustering algorithm without taking into account the context: why does the user cluster his data in the first place, and what does he want to do with the clustering afterwards? We argue that clustering should not be treated as an application-independent mathematical problem, but should always be studied in the context of its end-use. Different techniques to evaluate clustering algorithms have to be developed for different uses of clustering. To simplify this procedure we argue that it will be useful to build a taxonomy of clustering problems to identify clustering applications which can be treated in a unified way and that such an effort will be more fruitful than attempting the impossible--developing optimal domain-independent clustering algorithms or even classifying clustering algorithms in terms of how they work.
    Constrained clustering
    Data stream clustering
    Conceptual clustering
    Brown clustering
    Clustering high-dimensional data
    Consensus clustering
    Citations (69)
    The Post-clustering algorithms, which cluster the results of Web search engine, have several different requirements from conventional clustering algorithms. In this paper, we propose the new post-clustering algorithm satisfying those requirements as many as possible. The proposed Concept ART is the form of combining the concept vector that have several advantages in document clustering with Fuzzy ART known as real-time clustering algorithms. Moreover we show that it is applicable to general-purpose clustering as well as post-clustering
    Data stream clustering
    Single-linkage clustering
    Conceptual clustering
    Constrained clustering
    Brown clustering
    Document Clustering
    Clustering high-dimensional data
    Citations (0)
    Though subspace clustering, ensemble clustering, alternative clustering, and multiview clustering are different approaches motivated by different problems and aiming at different goals, there are similar problems in these fields. Here we shortly survey these areas from the point of view of subspace clustering. Based on this survey, we try to identify problems where the different research areas could probably learn from each other.
    Consensus clustering
    Single-linkage clustering
    Data stream clustering
    Clustering high-dimensional data
    Constrained clustering
    k-medians clustering
    Brown clustering
    Citations (17)
    Constrained clustering
    Consensus clustering
    Single-linkage clustering
    Clustering high-dimensional data
    Data stream clustering
    Complete-linkage clustering
    Citations (14)
    Single-linkage clustering
    Consensus clustering
    Similarity (geometry)
    Constrained clustering
    Robustness
    Data stream clustering
    Clustering high-dimensional data
    Complete-linkage clustering
    Citations (8)