The Parallelization and Optimization of K-means Algorithm Based on Spark

Zitian Wang,Aibo Xu,Zipeng Zhang,Chunzhi Wang,Aijun Liu,Xiang Hu

The Parallelization and Optimization of K-means Algorithm Based on Spark

2020

Aiming at the deficiency of K-means clustering algorithm, Both the random selection of initial clustering center and the empirical determination of K value have a certain impact on k-means clustering results. A k-means clustering algorithm based on canopy algorithm and maximum and minimum distance is proposed. K-value is generated by canopy algorithm to avoid setting k-value artificially, The clustering center set was selected by using the weighted density method to reduce the impact of outliers on clustering results. Then the center point is selected by the maximum and minimum distance to avoid the clustering result falling into local optimum. The algorithm is parallelized on spark, Finally, the experimental results of UCI dataset show that the improved k-means algorithm not only improves the clustering quality, but also reduces the average iteration times of the algorithm. Experimental results show that the improved algorithm can effectively improve the efficiency and parallel computing ability of the algorithm.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations