Fault Tolerant Optimizations for High Performance Computing Systems
2020
In this dissertation, we present a comprehensive survey on the state-of-the-practice failure prediction methods for HPC systems. We further introduce the concept of data migration as a promising way of achieving proactive fault tolerance in HPC systems. We present a lightweight application library – called LAIK – to assist application programmers in making their applications fault tolerant. Moreover, we propose an extension – called MPI sessions and MPI process sets – to the state-of-the-art programming model for HPC applications – the Message Passing Interface (MPI) – in order to benefit from failure prediction.
Keywords:
- Correction
- Cite
- Save
- Machine Reading By IdeaReader
0
References
0
Citations
NaN
KQI