Extended task queuing: active messages for heterogeneous systems

Michael LeBeane,Brandon Potter,Abhisek Pan,Alexandru Dutu,Vinay Agarwala,Wonchan Lee,Deepak Majeti,Bibek Ghimire,Eric Van Tassell,Samuel Wasmundt,Brad Benton,Mauricio Breternitz,Michael L. Chu,Mithuna Thottethodi,Lizy Kurian John,Steven K. Reinhardt

Extended task queuing: active messages for heterogeneous systems

2016

Accelerators have emerged as an important component of modern cloud, datacenter, and HPC computing environments. However, launching tasks on remote accelerators across a network remains unwieldy, forcing programmers to send data in large chunks to amortize the transfer and launch overhead. By combining advances in intra-node accelerator unification with one-sided Remote Direct Memory Access (RDMA) communication primitives, it is possible to efficiently implement lightweight tasking across distributed-memory systems. This paper introduces Extended Task Queuing (XTQ), an RDMA-based active messaging mechanism for accelerators in distributed-memory systems. XTQ's direct NIC-to-accelerator communication decreases inter-node GPU task launch latency by 10-15% for small-to-medium sized messages and ameliorates CPU message servicing overheads. These benefits are shown in the context of MPI accumulate, reduce, and allreduce operations with up to 64 nodes. Finally, we illustrate how XTQ can improve the performance of popular deep learning workloads implemented in the Computational Network Toolkit (CNTK).

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations