SpongeFiles: mitigating data skew in mapreduce using distributed memory

2014 
Data skew is a major problem for data processing platforms like MapReduce. Skew causes worker tasks to spill to disk what they cannot fit in memory, which slows down the task and the overall job. Moreover, performance of other jobs sharing same disk degrades. In many cases, this situation occurs even as the cluster has plenty of spare memory it is just not used evenly. We introduce SpongeFiles, a novel distributed-memory abstraction tailored to data processing environments like MapReduce. A SpongeFile is a logical byte array, comprised of large chunks that can be stored in a variety of locations in the cluster. Spilled data goes to SpongeFiles, which route it to the nearest location with sufficient capacity (local memory, remote memory, local disk, or remote disk as a last resort). By enabling memory-sapped nodes to tap into the spare capacity of their neighbors, SpongeFiles minimize expensive disk spilling, thereby improving performance. In our experiments with Hadoop and Pig, SpongeFiles reduce overall job runtimes by up to 55% and by up to 85% under disk contention.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    27
    References
    22
    Citations
    NaN
    KQI
    []