RUPredHadoop: Resources Utilization Predictor for Hadoop with Large-Scale Clusters

2018 
Apache Hadoop is a widely used distributed system in large-scale production environment. With the increasing size of data volume and cluster scale, its performance is limited by inappropriate resources utilization. This paper introduces a resources utilization predictor (RUPredHadoop) to predict utilization of cpu, memory, read/write rate of disk and network, especially for large-scale Hadoop clusters. In terms of the similarity of data and workflow in Hadoop, the pattern of resource utilization for a single task is proposed, and then formulized by a single task model. Besides that, the distribution of fine-grained runtime is studied, so that a parallel-batch-tasks-based model could regenerate the whole Mapreduce job by migrating the single task model from the minimum cluster to a large-scale production cluster. With RUPredHadoop, we can locate the resource bottleneck for Hadoop clusters, meanwhile we can agilely configure clusters for applications with massive data. The performance of RUPredHadoop is validated by a test cluster with 35 nodes and a production cluster with 80 nodes. Results show that the normalization error is below 10% for benchmark applications with maximum 100 TB data.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    19
    References
    0
    Citations
    NaN
    KQI
    []