The Parallelization and Optimization of K-means Algorithm Based on Spark

2020 
Aiming at the deficiency of K-means clustering algorithm, Both the random selection of initial clustering center and the empirical determination of K value have a certain impact on k-means clustering results. A k-means clustering algorithm based on canopy algorithm and maximum and minimum distance is proposed. K-value is generated by canopy algorithm to avoid setting k-value artificially, The clustering center set was selected by using the weighted density method to reduce the impact of outliers on clustering results. Then the center point is selected by the maximum and minimum distance to avoid the clustering result falling into local optimum. The algorithm is parallelized on spark, Finally, the experimental results of UCI dataset show that the improved k-means algorithm not only improves the clustering quality, but also reduces the average iteration times of the algorithm. Experimental results show that the improved algorithm can effectively improve the efficiency and parallel computing ability of the algorithm.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    8
    References
    0
    Citations
    NaN
    KQI
    []