Efficient Clustering Techniques on Hadoop and Spark

Giuseppe Di Fatta,Sami Al Ghamdi

Efficient Clustering Techniques on Hadoop and Spark

2019

Clustering is an essential data mining technique that divides observations into groups where each group contains similar observations. K-means is one of the most popular clustering algorithms that has been used for over 50 years. Due to the current exponential growth of the data, it became a necessity to improve the efficiency and scalability of K-means even further to cope with large-scale datasets known as big data. This paper presents K-means optimisations using triangle inequality on two well-known distributed computing platforms: Hadoop and Spark. K-means variants that use triangle inequality usually require caching extra information from the previous iteration, which is a challenging task to achieve on Hadoop. Hence, this work introduces two methods to pass information from one iteration to the next on Hadoop to accelerate K-means. The experimental work shows that the efficiency of K-means on Hadoop and Spark can be significantly improved by using triangle inequality optimisations.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations