A High-Performing Similarity Measure for Categorical Dataset with SF-Tree Clustering Algorithm

Mahmoud A. Mahdi,Samir E. AbdelRahman,Reem Bahgat

A High-Performing Similarity Measure for Categorical Dataset with SF-Tree Clustering Algorithm

2018

Tasks such as clustering and classification assume the existence of a similarity measure to assess the similarity (or dissimilarity) of a pair of observations or clusters. The key difference between most clustering methods is in their similarity measures. This article proposes a new similarity measure function called PWO “Probability of the Weights between Overlapped items ”which could be used in clustering categorical dataset; proves that PWO is a metric; presents a framework implementation to detect the best similarity value for different datasets; and improves the F-tree clustering algorithm with Semi-supervised method to refine the results. The experimental evaluation on real categorical datasets, such as “Mushrooms, KrVskp, Congressional Voting, Soybean-Large, Soybean-Small, Hepatitis, Zoo, Lenses, and Adult-Stretch” shows that PWO is more effective in measuring the similarity between categorical data than state-of-the-art algorithms; clustering based on PWO with pre-defined number of clusters results a good separation of classes with a high purity of average 80% coverage of real classes; and the overlap estimator perfectly estimates the value of the overlap threshold using a small sample of dataset of around 5% of data size.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations