Rethinking Transport Layer Design for Distributed Machine Learning.

Jiacheng Xia,Gaoxiong Zeng,Junxue Zhang,Weiyan Wang,Wei Bai,Junchen Jiang,Kai Chen

Rethinking Transport Layer Design for Distributed Machine Learning.

2019

Motivated by the increasing scale of data, we see a growing need of high performance distributed machine learning systems. Many research works are being proposed to improve distributed machine learning performance. In this paper, we call upon this community to rethink transport layer solutions for distributed machine learning due to their stringent network requirements and special algorithmic properties. Distributed machine learning jobs generate bursty traffic when synchronizing parameters and a long tail flow can significantly slow down the complete training process. Meanwhile, in contrast to other distributed system applications, we find that machine learning algorithms are bounded-loss tolerant: randomized network data losses below a certain fraction (typically 10%--35%) will do little harm to the end to end job performance. Motivated by this observation, we highlight new opportunities to design bounded-loss tolerant transport to optimize the performance of distributed machine learning. By intentionally ignoring some packet losses, we can avoid unnecessary loss retransmissions, thus reducing the tail flow completion time. Following this principle, our preliminary results show that a simplified protocol can give 1.1-2.2x speedup on different distributed machine learning tasks.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations