A robust fuzzy approach for gene expression data clustering

Abstract In the big data era, clustering is one of the most popular data mining method. The majority of clustering algorithms have complications like automatic cluster number determination, poor clustering precision, inconsistent clustering of various datasets and parameter-dependent etc. A new fuzzy autonomous solution for clustering named Meskat-Mahmudul (MM) clustering algorithm proposed to overcome the complexity of parameter–free automatic cluster number determination and clustering accuracy. MM clustering algorithm finds out the exact number of clusters based on Average Silhouette method in multivariate mixed attribute dataset, including real-time gene expression dataset and dealt missing values, noise and outliers. MM Extended K-Means (MMK) clustering algorithm is an enhancement of the K-Means algorithm, which serves the purpose for automatic cluster discovery and runtime cluster placement. Several validation methods used to evaluate cluster and certify optimum cluster partitioning and perfection. Some datasets used to assess the performance of the proposed algorithms to other algorithms in terms of time complexity and clustering efficiency. Finally, MM clustering and MMK clustering algorithms found superior over conventional algorithms.

Data stream clustering

Single-linkage clustering

k-medians clustering

Clustering high-dimensional data

10.21203/rs.3.rs-547452/v1

Cite

Citations (0)

GCHL: A grid-clustering algorithm for high-dimensional very large spatial data bases

Pattern Recognition Letters (2004)

Abdol Hamid Pilevar M. Sukumar

Data stream clustering

Clustering high-dimensional data

Single-linkage clustering

Constrained clustering

DBSCAN

10.1016/j.patrec.2004.09.052

Cite

Citations (60)

A robust fuzzy approach for gene expression data clustering

Soft Computing (2021)

Meskat Jahan Mahmudul Hasan

Data stream clustering

Single-linkage clustering

k-medians clustering

Clustering high-dimensional data

10.1007/s00500-021-06397-7

Cite

Citations (11)

Reclust: an efficient clustering algorithm for mixed data based on reclustering and cluster validation

Indonesian Journal of Electrical Engineering and Computer Science (2022)

M. Amala Jayanthi I. Elizabeth Shanthi

<span>Clustering is a significant approach in data mining, which seeks to find groups or clusters of data. Both numeric and categorical features are frequently used to define the data in real-world applications. Several different clustering algorithms are proposed for the numerical and categorical datasets. In clustering algorithms, the quality of clustering results is evaluated using cluster validation. This paper proposes an efficient clustering algorithm for mixed numerical and categorical data using re-clustering and cluster validation. Initially, the mixed dataset is clustered with four traditional clustering algorithms like expectation-maximization (EM), hierarchical cluster (HC), k-means (KM), and self-organizing map (SOM). These four algorithms are validated, and the best algorithm is selected for re-clustering. It is an iterative process for improving the quality of cluster results. The incorrectly clustered data is iteratively re-clustered and evaluated based on the cluster validation. The performance of the proposed clustering method is evaluated with a real-time dataset in terms of purity, normalized mutual information, rand index, precision, and recall. The experimental results have shown that the proposed reclust algorithm achieves better performance compared to other clustering algorithms.</span>

Single-linkage clustering

Data stream clustering

Categorical variable

Hierarchical clustering

k-medians clustering

Clustering high-dimensional data

10.11591/ijeecs.v29.i1.pp545-552

Cite

Citations (5)

Improved accelerating large data K-means clustering algorithm

Jisuanji gongcheng yu sheji (2015)

Han Ya

To deal with large-scale data clustering problems,a speeding K-means parallel clustering method was presented which randomly sampled first and then used max-min distance means to carry out K-means parallel clustering.Sampling based method avoids the problem of clustering in local solutions and max-min distance based method makes the initial clustering centers tend to be optimum.Results of a large number of experiments show that the proposed method is affected less by the initial clustering center and improves the precision of clustering in both stand-alone environment and cluster environment.It also reduces the number of iterations of clustering and the clustering time.

Data stream clustering

Single-linkage clustering

k-medians clustering

Clustering high-dimensional data

Source

Cite

Citations (0)

Correlation clustering based on genetic algorithm for documents clustering

Zhenya Zhang Hongmei Cheng Wanli Chen Shuguang Zhang Qiansheng Fang

Correlation clustering problem is a NP hard problem and technologies for the solving of correlation clustering problem can be used to cluster given data set with relation matrix for data in the given data set. In this paper, an approach based on genetic algorithm for correlation clustering problem, named as GeneticCC, is presented. To estimate the performance of a clustering division, data correlation based clustering precision is defined and features of clustering precision are discussed in this paper. Experimental results show that the performance of clustering division for UCI document data set constructed by GeneticCC is better than clustering performance of other clustering divisions constructed by SOM neural network with clustering precision as criterion.

Single-linkage clustering

Data stream clustering

Clustering high-dimensional data

k-medians clustering

10.1109/cec.2008.4631230

Cite

Citations (11)

An Effective High Dimensional Categorical Data Clustering Method Research

Microelectronics & Computer (2011)

Deyu Li

With the increasing size of data set,improving the efficiency of K-modes clustering algorithm or fuzzy K-modes clustering algorithm is becoming a critical issue.In order to improve the efficiency of the algorithm,a clustering method based on divided and conquered method was proposed.This method,not a one-time clustering of all data,divided the data set into several subsets,and each subset was clustered at the same time;the fusion results of each subset cluster form the final clustering results.The results show that the efficiency of clustering has been increased greatly compared with traditional clustering method in most cases.

Single-linkage clustering

Data stream clustering

Clustering high-dimensional data

Categorical variable

k-medians clustering

Source

Cite

Citations (0)

A H-K clustering algorithm based on ensemble learning

Ying He Yanfeng Shang Jian Wang Liang-xi Qin Wenfei Wang

The traditional H-K clustering algorithm can solve the randomness and apriority of the initial centers of K-means clustering algorithm. However, it will lead to a dimensional disaster problem when apply to high dimensional dataset clustering due to its high computational complexity. Clustering ensemble exerts ensemble learning technique to get a better clustering result through learning merged data set of multiple clustering results. The objective of this paper is to improve the performance of traditional H-K clustering algorithm in high dimensional datasets. Using ensemble learning, a new clustering algorithm is proposed named EPCAHK (Ensemble Principle Component Analysis Hierarchical K-means Clustering algorithm). In the EPCAHK algorithm, the high dimensional dataset is mapped into a low dimensional space using PCA method firstly. Subsequently, the clustering results of the hierarchical stage for obtaining initial information (e.g., the cluster number or the initial clustering centers) are integrated by using the min-transitive closure method. Finally, the final clustering result is achieved by using K-means clustering algorithm based on the ensemble clustering results above. The experimental results indicate that comparing to the traditional H-K clustering algorithm, the EPCAHK obtains a better clustering result. The average accuracy of the clustering results can reach up to 90% or above, and the stability for the large high dimensional dataset is also improved.

Ensemble Learning

K-Means Clustering

10.1049/cp.2013.1976

Cite

Citations (10)

On K-means data clustering algorithm with genetic algorithm

Shruti Kapil Meenu Chawla Mohd Dilshad Ansari

Clustering has been used in various disciplines like software engineering, statistics, data mining, image analysis, machine learning, Web cluster engines, and text mining in order to deduce the groups in large volume of data. The notion behind clustering is to ascribe the objects to clusters in such a way that objects in one cluster are more homogeneous to other clusters. There are variegated clustering algorithms available viz k-means clustering, cobweb clustering, db-scan clustering, fartherstfirst clustering, and x-means clustering algorithm but K-means on the whole comprehensively used algorithm for unsupervised clustering dilemma. In this paper k-means clustering is being optimised using genetic algorithm so that the problems of k-means can be overridden. The outcomes of k-means clustering and genetic k-means clustering are evaluated and compared; obtained result shows K-means with GA algorithm suggest new improvements in this research domain.

Single-linkage clustering

Data stream clustering

Clustering high-dimensional data

k-medians clustering

DBSCAN

10.1109/pdgc.2016.7913145

Cite

Citations (102)