A Parallel Implementation of the K-Means Algorithm Based on MapReduce

2014 
With the data explosion, data mining algorithms are required to deal with huge amounts of records. In the traditional way, the processing goes in one single control flow, the time spent in computing grows fast with the increasing of data scale. K-means is one of the widely used algorithms in cluster analysis. MapReduce is a programming model which has been widely used for processing data in a parallel environment. This paper gives an implementation of the K-means algorithm based on the MapReduce model, so that the clustering system could handle the massive data in a fast and scalable fashion. The brief structure of the algorithm and the analysis for the main improvement are also given. We demonstrated that the algorithm will be superior when the volume of data grows bigger or the number of nodes in the computer cluster grows much bigger.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    4
    References
    0
    Citations
    NaN
    KQI
    []