Fault Tolerant Optimizations for High Performance Computing Systems

2020 
In this dissertation, we present a comprehensive survey on the state-of-the-practice failure prediction methods for HPC systems. We further introduce the concept of data migration as a promising way of achieving proactive fault tolerance in HPC systems. We present a lightweight application library – called LAIK – to assist application programmers in making their applications fault tolerant. Moreover, we propose an extension – called MPI sessions and MPI process sets – to the state-of-the-art programming model for HPC applications – the Message Passing Interface (MPI) – in order to benefit from failure prediction.
    • Correction
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []