An Effective High Dimensional Categorical Data Clustering Method Research
0
Citation
0
Reference
20
Related Paper
Abstract:
With the increasing size of data set,improving the efficiency of K-modes clustering algorithm or fuzzy K-modes clustering algorithm is becoming a critical issue.In order to improve the efficiency of the algorithm,a clustering method based on divided and conquered method was proposed.This method,not a one-time clustering of all data,divided the data set into several subsets,and each subset was clustered at the same time;the fusion results of each subset cluster form the final clustering results.The results show that the efficiency of clustering has been increased greatly compared with traditional clustering method in most cases.Keywords:
Single-linkage clustering
Data stream clustering
Clustering high-dimensional data
Categorical variable
k-medians clustering
Cite
This paper presents an improved hierarchical K-means clustering algorithm combining hierarchical structure of space,in order to solve the problem that bad result of traditional K-means clustering method by selecting the number of categories randomly before clustering.By primary K-means clustering,it determines whether re-clustering in the more fine level by the result of initial clustering.By repeated execution,a hierarchical K-means clustering tree is produced,and the number of clusters is selected automatically on this tree structure.Simulation results on UCI datasets demonstrate that comparing with traditional K-means clustering means,the better clustering results are obtained by the hierarchical K-means clustering model.
Single-linkage clustering
Hierarchical clustering
Brown clustering
Data stream clustering
Cite
Citations (13)
Document clustering is an integral and important part of text mining.There are two types of clustering, namely, hard clustering and soft clustering.In case of hard clustering, data item belongs to only one cluster whereas in soft clustering, data point may fall into more than one cluster.Thus, soft clustering leads to fuzzy clustering wherein each data point is associated with a membership function that expresses the degree to which individual data points belong to the cluster.Accuracy is desired in information retrieval, which can be achieved by fuzzy clustering.In the work presented here, a fuzzy approach for text classification is used to classify the documents into appropriate clusters using Fuzzy C Means (FCM) clustering algorithm.Enron email dataset is used for experimental purpose.Using FCM clustering algorithm, emails are classified into different clusters.The results obtained are compared with the output produced by k means clustering algorithm.The comparative study showed that the fuzzy clusters are more appropriate than hard clusters.
Single-linkage clustering
FLAME clustering
Clustering high-dimensional data
Document Clustering
k-medians clustering
Cite
Citations (15)
Clustering documents enable the user to have a good overall view of the information contained in the documents. Most classical clustering algorithms assign each data to exactly one cluster, thus forming a crisp partition of the given data, but fuzzy clustering allows for degrees of membership, to which a data belongs to different clusters. In this system, documents are clustered by using fuzzy c-means (FCM) clustering algorithm. FCM clustering is one of well-know unsupervised clustering techniques. However FCM algorithm requires the user to pre-define the number of clusters and different values of clusters corresponds to different fuzzy partitions. So the validation of clustering result is needed. PBM index and F-measure are used for cluster validity.
Single-linkage clustering
FLAME clustering
Complete-linkage clustering
k-medians clustering
Consensus clustering
Cite
Citations (10)
Cluster analysis refers to the process of grouping a collection of physical or abstract objects into multiple classes of similar objects. Determining the optimal classification number of a data set is the key to the clustering problem, that is to say whether the data set can be effectively partitioned. Cluster validity study is a process of establishing clustering effectiveness indicators, evaluating clustering quality and determining the optimal number of clusters. A validity function of fuzzy C-means (FCM) clustering algorithm is proposed by adopting the division of intra-class compactness and inter-class separation, whose minimum represents the best clustering. Then, the proposed validity function on FCM clustering algorithm is compared with the known typical validity functions by carrying out simulation experiments to compare the related clustering performance. Three data sets are adopted to carry out FCM clustering, which includes three classical data sets, two artificial data sets and six real data sets in UCI database. Simulation experimental results show that the proposed validity function can effectively partition the data set.
Single-linkage clustering
Data stream clustering
Constrained clustering
k-medians clustering
Cite
Citations (35)
The main defect of traditional methods of FCM algorithm is sensitive to the isolated data and is to know the number of clustering in advance.A fuzzy clustering algorithm NSFCM is presented in this paper,and NSFCM agorithm is applied to text mining.This algorithm adds a weight to the membership of the data,which is to decrease the effect on the initial cluster center.This paper applies average information entropy to find the number of clusters and adopts a density function algorithm to find the initial cluster centers.The experiment shows both the precision and the efficiency of clustering NSFCM are higher than those of FCM.
Single-linkage clustering
k-medians clustering
Cite
Citations (7)
Data clustering is a process of putting similar data into groups. A clustering algorithm partitions a data set into several groups based on the principle of maximizing the intra-class similarity and minimizing the inter-class similarity. This paper analyze the three major clustering algorithms: K-Means, Hierarchical clustering and Density based clustering algorithm and compare the performance of these three major clustering algorithms on the aspect of correctly class wise cluster building ability of algorithm. Performance of the 3 techniques are presented and compared using a clustering tool WEKA.
Single-linkage clustering
Data stream clustering
Hierarchical clustering
Similarity (geometry)
Consensus clustering
Cite
Citations (42)
Correlation clustering problem is a NP hard problem and technologies for the solving of correlation clustering problem can be used to cluster given data set with relation matrix for data in the given data set. In this paper, an approach based on genetic algorithm for correlation clustering problem, named as GeneticCC, is presented. To estimate the performance of a clustering division, data correlation based clustering precision is defined and features of clustering precision are discussed in this paper. Experimental results show that the performance of clustering division for UCI document data set constructed by GeneticCC is better than clustering performance of other clustering divisions constructed by SOM neural network with clustering precision as criterion.
Single-linkage clustering
Data stream clustering
Clustering high-dimensional data
k-medians clustering
Cite
Citations (11)
In order to improve the efficiency we propose a distributed clustering algorithm based on large data sets.Namely data is randomly divided into several subsets without clustering all the data at a time,then we cluster all the subsets at the same time.At last we combine the genus.Experiment results show that most of time the result is the same as using traditional clustering algorithm,and it improves the clustering speed greatly.
Single-linkage clustering
Data stream clustering
k-medians clustering
Clustering high-dimensional data
Cite
Citations (0)
Data set
Clustering high-dimensional data
Consensus clustering
Cite
Citations (3)
Cluster analysis is an unsupervised most important research topics in the field of pattern recognition. Fuzzy clustering from the sample to the category of uncertainty description, it is possible to more objectively reflect the real world. Traditional fuzzy clustering algorithm can not achieve the optimal allocation of the number of clusters is calculated automatically. In this paper, by adopting the idea of hierarchical clustering, one can automatically and efficiently determine the optimal number of clusters of new adaptive fuzzy c-means clustering algorithm-A-FCM algorithm. Numerical experiments show that the other through a variety of validity function to determine the number of clusters of adaptive fuzzy clustering algorithm, the better the performance of the method.
FLAME clustering
Single-linkage clustering
Hierarchical clustering
Cite
Citations (2)