Improved MapReduce Load Balancing through Distribution-Dependent Hash Function Optimization

2020 
Load balancing of skewed data in MapReduce systems like Hadoop is a well-studied problem. Many heuristics already exist to improve the load balance of the reducers thereby reducing the overall execution time. In this paper, we propose a lightweight optimization approach for MapReduce systems to minimize the makespan for repetitive tasks involving a typical frequency distribution. Our idea is to analyze the observed frequency distribution for the given task so as to identify an optimal offset parameter $c$ to add in the hash function to minimize makespan. For two different bucketing methods - modulo labeling and consecutive binning - we present efficient algorithms for finding the optimal value of $c$ . Finally, we present simulation results for both bucketing methods. The results vary with the data distribution and the number of reducers, but generally reduce makespan by 20% on average for power-law distributions, Results are confirmed with experiments on well-known real-world data sets.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    23
    References
    0
    Citations
    NaN
    KQI
    []