New Partition-based and Density-based approaches for improving clustering

Nabil El Malki

New Partition-based and Density-based approaches for improving clustering

2021

Nabil El Malki

Clustering is a branch of machine learning consisting in dividing a dataset into several groups, called clusters. Each cluster contains data with similar characteristics. Several clustering approaches exist that differ in complexity and efficiency due to the multitude of clustering applications. In this thesis, we are mainly interested in centroid-based methods, more specifically k-means and density-based methods. In each approach, we have made contributions that address different problems.Due to the growth of the amount of data produced by different sources (sensors,social networks, information systems...), it is necessary to design fast algorithms to manage this growth. One of the best-known problems in clustering is the k-means problem. It is considered NP-hard in the number of points and clusters. Lloyd’sheuristic has approximated the solution to this problem. This is one of the ten most used methods in data mining because of its algorithmic simplicity. Nevertheless, this iterative heuristic does not propose an optimization strategy that avoids repetitive calculations. Versions based on geometric reasoning have partially addressed this problem. In this manuscript, we proposed a strategy to reduce unnecessary compu-tations in Lloyd’s version and the versions based on geometric reasoning. It consists mainly in identifying, by estimation, the stable points, i.e., they no longer contribute to improving the solution during the iterative process of k-means. Thus, calculations related to stable points are avoided.K-means requires a priori, from users, the value of the number of K clusters. It is necessary for K to be the closest to the ground truth. Otherwise, the result of partitioning is of low quality or even unusable. We proposed Kd-means, an algorithm based on a hierarchical approach. It consists in hierarchizing data in a Kd-tree data structure and then merging sub-groups of points recursively in the bottom-up direction using new inter-group merging criteria that we have developed. These criteria guide the merging process to estimate K closest to real and produce clusters with a more complex shape than sphericity. Through experimentation, Kd-means has clearly shown its superiority over its competitors in execution time, clustering quality and K estimation. The density-based approach’s challenges are the high dimensionality of the points,the difficulty to separate low-density clusters from groups of outliers, and the sep-aration of close clusters of the same density. To address these challenges, we have developed DECWA, a method based on a probabilistic approach. In DECWA, we proposed 1) a strategy of dividing a dataset into sub-groups where each of them follows its probability law; 2) followed by another strategy that merges subgroups, similar in probability law, into final clusters. Experimentally, DECWA, in high-dimensional spaces, produces a good quality clustering compared to its competitors

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations