Hierarchical Clustering Algorithm for Binary Data Based on Cosine Similarity

2018 
Clustering algorithm for binary data is a challenging problem in data mining and machine learning fields. While some efforts have been made to deal with clustering binary data, they lack effective methods to balance clustering quality and efficiency. To this end, we propose a hierarchical clustering algorithm for binary data based on cosine similarity (HABOC) in this paper. Firstly, we assess similarity between data objects with binary attributes using Cosine Similarity (CS). Then, the Cosine Similarity of a Set (CSS) is defined to compute similarity of a set containing multiple objects. Based on CSS, we propose the Cosine Feature Vector of a Set (CFVS) and additivity of CFVS to compress data and merge two clusters directly. We also exploit hierarchical clustering method to implement clustering, in order to avoid the sensitivity to the order of data objects and algorithm parameters. Experimental results on several UCI datasets demonstrate that HABOC outperforms existing binary data clustering algorithms.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    21
    References
    1
    Citations
    NaN
    KQI
    []