HAL: Computer System for Scalable Deep Learning

Volodymyr Kindratenko,Dawei Mu,Yan Zhan,John Maloney,Sayed Hadi Hashemi,Benjamin Rabe,Ke Xu,Roy H. Campbell,Jian Peng,William Gropp

HAL: Computer System for Scalable Deep Learning

2020

Volodymyr Kindratenko
Dawei Mu
Yan Zhan
John Maloney
Sayed Hadi Hashemi
Benjamin Rabe
Ke Xu
Roy H. Campbell
Jian Peng
William Gropp

We describe the design, deployment and operation of a computer system built to efficiently run deep learning frameworks. The system consists of 16 IBM POWER9 servers with 4 NVIDIA V100 GPUs each, interconnected with Mellanox EDR InfiniBand fabric, and a DDN all-flash storage array. The system is tailored towards efficient execution of the IBM Watson Machine Learning enterprise software stack that combines popular open-source deep learning frameworks. We build a custom management software stack to enable an efficient use of the system by a diverse community of users and provide guides and recipes for running deep learning workloads at scale utilizing all available GPUs. We demonstrate scaling of a PyTorch and TensorFlow based deep neural networks to produce state-of-the-art performance results.

Keywords:

Artificial intelligence
Deep learning
Computer architecture
Scalability
Computer science

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations