Straggler-Proofing Massive-Scale Distributed Matrix Multiplication with D-Dimensional Product Codes

2018 
Distributed computing allows for large-scale computation and machine learning tasks by enabling parallel computing at massive scale. A critical challenge to speeding up distributed computing comes from stragglers, a crippling bottleneck to system performance [1]. Recently, coding theory has offered an attractive paradigm dubbed as coded computation [2] for addressing this challenge through the judicious introduction of redundant computing to combat stragglers. However, most existing approaches have limited applicability if the system scales to hundreds or thousands of workers, as is the trend in computing platforms. At these scales, previously proposed algorithms based on Maximum Distance Separable (MDS) codes are too expensive due to their hidden cost, i.e., computing and communication costs associated with the encoding/decoding procedures. Motivated by this limitation, we present a novel coded matrix-matrix multiplication scheme based on d-dimensional product codes. We show that our scheme allows for order-optimal computation/communication costs for the encoding/decoding procedures while achieving near-optimal compute time.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    14
    References
    62
    Citations
    NaN
    KQI
    []