Failure recovery for bulk synchronous applications with MPI stages

2019 
Abstract When an MPI program experiences a failure, the most common recovery approach is to restart all processes from a previous checkpoint and to re-queue the entire job. A disadvantage of this method is that, although the failure occurred within the main application loop, live processes must start again from the beginning of the program, along with new replacement processes—this incurs unnecessary overhead for live processes. To avoid such overheads and concomitant delays, we introduce the concept of “MPI Stages.” MPI Stages saves internal MPI state in a separate checkpoint in conjunction with application state. Upon failure, both MPI and application state are recovered, respectively, from their last synchronous checkpoints and continue without restarting the overall MPI job. Live processes roll back only a few iterations within the main loop instead of rolling back to the beginning of the program, while a replacement of failed process restarts and reintegrates, thereby achieving faster failure recovery. This approach integrates well with large-scale, bulk synchronous applications and checkpoint/restart. In this article, we identify requirements for production MPI implementations to support state checkpointing with MPI Stages, which includes capturing and managing internal MPI state and serializing and deserializing user handles to MPI objects. We evaluate our fault tolerance approach with a proof-of-concept prototype MPI implementation that includes MPI Stages. We demonstrate its functionality and performance using LULESH, CoMD, and microbenchmarks. Our results show that MPI Stages reduces the recovery time for both LULESH and CoMD in comparison to checkpoint/restart and Reinit (a global-restart model).
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    25
    References
    7
    Citations
    NaN
    KQI
    []