An Oracle for Guiding Large-Scale Model/Hybrid Parallel Training of Convolutional Neural Networks
2021
Deep Neural Network (DNN) frameworks use distributed training to enable
faster time to convergence and alleviate memory capacity limitations when
training large models and/or using high dimension inputs. With the steady
increase in datasets and model sizes, model/hybrid parallelism is deemed to
have an important role in the future of distributed training of DNNs. We
analyze the compute, communication, and memory requirements of Convolutional
Neural Networks (CNNs) to understand the trade-offs between different
parallelism approaches on performance and scalability. We leverage our
model-driven analysis to be the basis for an oracle utility which can help in
detecting the limitations and bottlenecks of different parallelism approaches
at scale. We evaluate the oracle on six parallelization strategies, with four
CNN models and multiple datasets (2D and 3D), on up to 1024 GPUs. The results
demonstrate that the oracle has an average accuracy of about 86.74% when
compared to empirical results, and as high as 97.57% for data parallelism.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
42
References
0
Citations
NaN
KQI