Scaling collectives on large clusters using Intel(R) architecture processors and fabric

Masashi Horikoshi,Larry Meadows,Tom Elken,Pradeep Sivakumar,Edward Mascarenhas,James Erwin,Dmitry Durnov,Alexander Sannikov,Toshihiro Hanawa,Taisuke Boku

Scaling collectives on large clusters using Intel(R) architecture processors and fabric

2018

Masashi Horikoshi
Larry Meadows
Tom Elken
Pradeep Sivakumar
Edward Mascarenhas
James Erwin
Dmitry Durnov
Alexander Sannikov
Toshihiro Hanawa
Taisuke Boku

This paper provides results on scaling Barrier and Allreduce to 8192 nodes on a cluster of Intel® Xeon Phi™ processors installed at the University of Tokyo and the University of Tsukuba. We will describe the effects of OS and platform noise on the performance of these collectives, and provide ways to minimize the noise as well as isolate it to specific cores. We will provide results showing that Barrier and Allreduce scale well when noise is reduced. We were able to achieve a latency of 94 usec (7.1x speedup from baseline) or 1 rank per node Barrier and 145 usec (3.3x speedup) for Allreduce at the 16 byte (16B) message size at 4096 nodes.

Keywords:

Parallel computing
Xeon
Speedup
Byte
Network performance
Architecture
Cluster (physics)
Scaling
Latency (engineering)
Computer science

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations