Extended task queuing: active messages for heterogeneous systems

2016 
Accelerators have emerged as an important component of modern cloud, datacenter, and HPC computing environments. However, launching tasks on remote accelerators across a network remains unwieldy, forcing programmers to send data in large chunks to amortize the transfer and launch overhead. By combining advances in intra-node accelerator unification with one-sided Remote Direct Memory Access (RDMA) communication primitives, it is possible to efficiently implement lightweight tasking across distributed-memory systems. This paper introduces Extended Task Queuing (XTQ), an RDMA-based active messaging mechanism for accelerators in distributed-memory systems. XTQ's direct NIC-to-accelerator communication decreases inter-node GPU task launch latency by 10-15% for small-to-medium sized messages and ameliorates CPU message servicing overheads. These benefits are shown in the context of MPI accumulate, reduce, and allreduce operations with up to 64 nodes. Finally, we illustrate how XTQ can improve the performance of popular deep learning workloads implemented in the Computational Network Toolkit (CNTK).
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    39
    References
    8
    Citations
    NaN
    KQI
    []