Comparison of different similarity measures in hierarchical clustering

2021 
The management of datasets containing heterogeneous types of data is a crucial point in the context of precision medicine, where genetic, environmental, and life-style information of each individual has to be analyzed simultaneously. Clustering represents a powerful method, used in data mining, for extracting new useful knowledge from unlabeled datasets. Clustering methods are essentially distance-based, since they measure the similarity (or the distance) between two elements or one element and the cluster centroid. However, the selection of the distance metric is not a trivial task: it could influence the clustering results and, thus, the extracted information. In this study we analyze the impact of four similarity measures (Manhattan or L1 distance, Euclidean or L2 distance, Chebyshev or L∞ distance and Gower distance) on the clustering results obtained for datasets containing different types of variables. We applied hierarchical clustering combined with an automatic cut point selection method to six datasets publicly available on the UCI Repository. Four different clusterizations were obtained for every dataset (one for each distance) and were analyzed in terms of number of clusters, number of elements in each cluster, and cluster centroids. Our results showed that changing the distance metric produces substantial modifications in the obtained clusters. This behavior is particularly evident for datasets containing heterogeneous variables. Thus, the choice of the distance measure should not be done a-priori but evaluated according to the set of data to be analyzed and the task to be accomplished.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    17
    References
    0
    Citations
    NaN
    KQI
    []