Alleviating Load Imbalance in Data Processing for Large-Scale Deep Learning

2020 
Scalable deep learning remains an onerous challenge, as it is constrained by many factors, including those related to load imbalance. For many deep-learning software systems, multiple data-processing components—including neural network training, graph scheduling, input pipeline, and gradient synchronization—execute simultaneously and asynchronously. Such execution can cause the various data-processing components to contend with one another for the hardware resources, leading to severe load imbalance and, in turn, degraded scalability. In this paper, we present an in-depth analysis of state-of-the-art deep-learning software, TensorFlow and Horovod, to understand their scalability limitations. Based on this analysis, we propose four novel solutions that minimize resource contention and improve deep-learning performance by up to 35% for training various neural networks on 24,576 GPUs of the Summit supercomputer at Oak Ridge National Laboratory.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    39
    References
    1
    Citations
    NaN
    KQI
    []