Communication Profiling and Characterization of Deep Learning Workloads on Clusters with High-Performance Interconnects

Ammar Ahmad Awan,Arpan Jain,Ching-Hsiang Chu,Hari Subramoni,D.K. Panda

Communication Profiling and Characterization of Deep Learning Workloads on Clusters with High-Performance Interconnects

2019

Heterogeneous high-performance computing systems with GPUs are equipped with high-performance interconnects like InfiniBand, Omni-Path, PCIe, and NVLink. However, little exists in the literature that captures the performance impact of these interconnects on distributed deep learning (DL). In this article, we choose Horovod, a distributed training middleware, to analyze and profile various DNN training workloads using TensorFlow and PyTorch in addition to standard MPI microbenchmarks. We use a wide variety of systems with CPUs like Intel Xeon and IBM POWER9, GPUs like Volta V100, and various interconnects to analyze the following metrics: 1) message-size with Horovod's tensor-fusion; 2) message-size without tensor-fusion; 3) number of MPI/NCCL calls; and 4) time taken by each MPI/NCCL call. We observed extreme performance variations for non-power-of-two message sizes on different platforms. To address this, we design a message-padding scheme for Horovod, illustrate significantly smoother allreduce latency profiles, and report cases where we observed improvement for end-to-end training.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations