Herring: rethinking the parameter server at scale for the cloud

Indu Thangakrishnan,Derya Cavdar,Can Karakus,Piyush Ghai,Yauheni Selivonchyk,Cory Pruce

Herring: rethinking the parameter server at scale for the cloud

2020

Training large deep neural networks is time-consuming and may take days or even weeks to complete. Although parameter-server-based approaches were initially popular in distributed training, scalability issues led the field to move towards all-reduce-based approaches. Recent developments in cloud networking technologies, however, such as the Elastic Fabric Adapter (EFA) and Scalable Reliable Datagram (SRD), motivate a re-thinking of the parameter-server approach to address its fundamental inefficiencies. To this end, we introduce a novel communication library, Herring, which is designed to alleviate the performance bottlenecks in parameter-server-based training. We show that gradient reduction with Herring is twice as fast as all-reduce-based methods. We further demonstrate that training deep learning models like BERTlarge using Herring outperforms all-reduce-based training, achieving 85% scaling efficiency on large clusters with up to 2048 NVIDIA V100 GPUs without accuracy drop.

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations