K-Means Parallel Acceleration for Sparse Data Dimensions on Flink

2019 
The K-means algorithm is a clustering algorithm which widely used in various applications, and it's running time is dramatically increased as the data size expanded. When the volume of data exceeds the range that can be carried by a single machine, the parallel operation of the algorithm must be implemented by using a distributed computing framework. Generally, during the parallel operation of the task, there are differences among the running time of each task due to the data skew, and the running progress of the entire job is determined by the task with the longest running time. In this paper, we propose an optimal data partitioning method for the application of the k-means algorithm on the sparsely dimensioned dataset to eliminate the data skew problem and further accelerate the parallel execution of the algorithm. Experimental evaluation on large-scale text datasets demonstrate the effectiveness of our partitioning approach on Flink.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    8
    References
    0
    Citations
    NaN
    KQI
    []