Improved MapReduce Load Balancing through Distribution-Dependent Hash Function Optimization

Zafar Ahmad,Sharmila Duppala,Rezaul Alam Chowdhury,Steven Skiena

Improved MapReduce Load Balancing through Distribution-Dependent Hash Function Optimization

2020

Load balancing of skewed data in MapReduce systems like Hadoop is a well-studied problem. Many heuristics already exist to improve the load balance of the reducers thereby reducing the overall execution time. In this paper, we propose a lightweight optimization approach for MapReduce systems to minimize the makespan for repetitive tasks involving a typical frequency distribution. Our idea is to analyze the observed frequency distribution for the given task so as to identify an optimal offset parameter $c$ to add in the hash function to minimize makespan. For two different bucketing methods - modulo labeling and consecutive binning - we present efficient algorithms for finding the optimal value of $c$ . Finally, we present simulation results for both bucketing methods. The results vary with the data distribution and the number of reducers, but generally reduce makespan by 20% on average for power-law distributions, Results are confirmed with experiments on well-known real-world data sets.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations