Alleviating Load Imbalance in Data Processing for Large-Scale Deep Learning

Sarunya Pumma,Daniele Buono,Fabio Checconi,Xinyu Que,Wu-chun Feng

Alleviating Load Imbalance in Data Processing for Large-Scale Deep Learning

2020

Scalable deep learning remains an onerous challenge, as it is constrained by many factors, including those related to load imbalance. For many deep-learning software systems, multiple data-processing components—including neural network training, graph scheduling, input pipeline, and gradient synchronization—execute simultaneously and asynchronously. Such execution can cause the various data-processing components to contend with one another for the hardware resources, leading to severe load imbalance and, in turn, degraded scalability. In this paper, we present an in-depth analysis of state-of-the-art deep-learning software, TensorFlow and Horovod, to understand their scalability limitations. Based on this analysis, we propose four novel solutions that minimize resource contention and improve deep-learning performance by up to 35% for training various neural networks on 24,576 GPUs of the Summit supercomputer at Oak Ridge National Laboratory.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations