Efficient and Robust Distributed Matrix Computations via Convolutional Coding

2021 
Distributed matrix computations are well-recognized to suffer from the problem of stragglers (slow or failed worker nodes). The majority of prior work in this area has presented straggler mitigation strategies that are (i) either sub-optimal in terms of their straggler resilience, or (ii) suffer from numerical problems, i.e., there is a blow-up of round-off errors in the decoded result owing to the high condition numbers of the corresponding decoding matrices. This work introduces a novel solution framework, based on embedding the computations into the structure of a convolutional code, that removes these limitations. Our approach is provably optimal in terms of its straggler resilience, and has excellent numerical robustness which can be theoretically quantified by deriving a computable upper bound on the worst case condition number over all possible decoding matrices. All above claims are backed up by extensive experiments done on the AWS cloud platform.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    18
    References
    0
    Citations
    NaN
    KQI
    []