MRTune: A simulator for performance tuning of MapReduce jobs with skewed data

2014 
MapReduce is a programming model designed by Google that has been widely used for both high performance computing and big data processing. Although the programming model is simple, it is very challenging to conduct performance tuning for a MapReduce job, considering the complexities of the configuration parameters and various tradeoffs between the performance gain of the optimization approaches and the extra overhead they bring about. One naive way to address this issue is to run the MapReduce jobs repeatedly using different combinations of configuration parameters and optimization methods, then select the one with the shortest running time. However, real execution is impractical because the combinations may be too many and the time of one run of each combination may be too long. Therefore, it is desirable if we can efficiently estimate the runtime of a job without real execution using only the input data and the configuration parameter settings of the cluster. In this paper, we propose a novel MapReduce simulator called MRTune for runtime estimation of MapReduce jobs. MRTune takes the key distribution of input data into consideration and can work well even when the key distribution of data is skewed. Moreover, MRTune can estimate the runtime of a job in the presence of unpredictable task failures. We evaluate MRTune implementing MapReduce jobs with Zipfian distributed input data. The result shows that MRTune can estimate the runtime of MapReduce jobs with high accuracy and efficiency while the key distribution of input data is skewed. We also conduct two case studies to analyse the impact of data skew and task failures on a MapReduce job.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    16
    References
    2
    Citations
    NaN
    KQI
    []