Rethinking Transport Layer Design for Distributed Machine Learning.

2019 
Motivated by the increasing scale of data, we see a growing need of high performance distributed machine learning systems. Many research works are being proposed to improve distributed machine learning performance. In this paper, we call upon this community to rethink transport layer solutions for distributed machine learning due to their stringent network requirements and special algorithmic properties. Distributed machine learning jobs generate bursty traffic when synchronizing parameters and a long tail flow can significantly slow down the complete training process. Meanwhile, in contrast to other distributed system applications, we find that machine learning algorithms are bounded-loss tolerant: randomized network data losses below a certain fraction (typically 10%--35%) will do little harm to the end to end job performance. Motivated by this observation, we highlight new opportunities to design bounded-loss tolerant transport to optimize the performance of distributed machine learning. By intentionally ignoring some packet losses, we can avoid unnecessary loss retransmissions, thus reducing the tail flow completion time. Following this principle, our preliminary results show that a simplified protocol can give 1.1-2.2x speedup on different distributed machine learning tasks.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    24
    References
    12
    Citations
    NaN
    KQI
    []