Overlapping MPI communications with Intel TBB computation
2020
Scientific computations require everlasting performance increase, up to the Exascale and beyond. To cope with this need at hardware level, the current trend is to emphasize architectures with a high number of cores and complex memory hierarchies. Task-based programming is a popular programming model to efficiently deal with those architectures. Among them, recursive task graphs are known to exhibit good features like knowledge of task predecessors, which can be exploited to reach nearly optimal scheduling strategies. In this paper, we tackle the communication/computation overlap problem when coupling this model with inter-process communications done with MPI. Since these task graphs are recursive, choosing when and how to progress communications is crucial to achieve good overlap while not impacting severely the computations. We propose in this paper three methods to improve the overlap of communications with computation, by inserting dedicated progress tasks at well-chosen spots of the recursive task graph. With an implementation of these methods in the Intel Threading Building Blocks runtime, we show up to 11% performance improvement with one method on a matrix-matrix multiplication benchmark.
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
0
References
0
Citations
NaN
KQI