A High-Performing Similarity Measure for Categorical Dataset with SF-Tree Clustering Algorithm
2018
Tasks such as clustering and classification assume
the existence of a similarity measure to assess the similarity
(or dissimilarity) of a pair of observations or clusters. The
key difference between most clustering methods is in their
similarity measures. This article proposes a new similarity measure
function called PWO “Probability of the Weights between
Overlapped items ”which could be used in clustering categorical
dataset; proves that PWO is a metric; presents a framework
implementation to detect the best similarity value for different
datasets; and improves the F-tree clustering algorithm with
Semi-supervised method to refine the results. The experimental
evaluation on real categorical datasets, such as “Mushrooms,
KrVskp, Congressional Voting, Soybean-Large, Soybean-Small,
Hepatitis, Zoo, Lenses, and Adult-Stretch” shows that PWO is
more effective in measuring the similarity between categorical
data than state-of-the-art algorithms; clustering based on PWO
with pre-defined number of clusters results a good separation
of classes with a high purity of average 80% coverage of real
classes; and the overlap estimator perfectly estimates the value
of the overlap threshold using a small sample of dataset of around
5% of data size.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
21
References
0
Citations
NaN
KQI