Do moldable applications perform better on failure-prone HPC platforms?

Valentin Le Fèvre,George Bosilca,Aurélien Bouteiller,Thomas Hérault,Atsushi Hori,Yves Robert,Jack Dongarra

Do moldable applications perform better on failure-prone HPC platforms?

2018

This paper compares the performance of different approaches to tolerate failures using checkpoint/restart when executed on large-scale failure-prone platforms. We study (i) \(\textsc {Rigid}\) applications, which use a constant number of processors throughout execution; (ii) \(\textsc {Moldable}\) applications, which can use a different number of processors after each restart following a fail-stop error; and (iii) \(\textsc {GridShaped}\) applications, which are moldable applications restricted to use rectangular processor grids (such as many dense linear algebra kernels). For each application type, we compute the optimal number of failures to tolerate before relinquishing the current allocation and waiting until a new resource can be allocated, and we determine the optimal yield that can be achieved. We instantiate our performance model with a realistic applicative scenario and make it publicly available for further usage.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations