OPS: Optimized Shuffle Management System for Apache Spark

2020 
In recent years, distributed computing frameworks, such as Hadoop MapReduce and Spark, are widely used for big data processing. With the explosive growth of the amount of data, companies tend to store intermediate data of the shuffle phase on disk instead of memory. Therefore, intensive network and disk I/O are both involved in the shuffle phase. To optimize the overhead of the shuffle phase, we propose OPS, an open-source distributed computing shuffle management system based on Spark, which provides an independent shuffle service for Spark. By using early-merge and early-shuffle strategy, OPS alleviates the I/O overhead in the shuffle phase and efficiently schedules the I/O and computing resources. OPS also proposes a slot-based scheduling algorithm to predict and calculate the optimal scheduling result of the reduce task. Besides, OPS provides a taint-redo strategy to ensure the fault tolerance of computing jobs. We evaluated the performance of OPS on a 100-node Amazon AWS EC2 cluster. Overall, OPS optimizes the overhead of shuffle by nearly 50%. In the test cases of HiBench, OPS improves end-to-end completion time by nearly 30% on average.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    22
    References
    0
    Citations
    NaN
    KQI
    []