Fault Tolerance for OpenSHMEM

Pengfei Hao,Pavel Shamis,Manjunath Gorentla Venkata,Swaroop Pophale,Aaron Welch,Stephen W. Poole,Barbara M. Chapman

Fault Tolerance for OpenSHMEM

2014

Pengfei Hao
Pavel Shamis
Manjunath Gorentla Venkata
Swaroop Pophale
Aaron Welch
Stephen W. Poole
Barbara M. Chapman

On today's supercomputing systems, faults are becoming a norm rather than an exception. Given the complexity required for achieving expected scalability and performance on future systems, this situation is expected to become worse. The systems are expected to function in a nearly constant presence of faults. To be productive on these systems, programming models will require both hardware and software to be resilient to faults. With the growing importance of PGAS programming model and OpenSHMEM, as a part of HPC software stack, a lack of a fault tolerance model may become a liability for its users. Towards this end, in this paper, we discuss the viability of using checkpoint/restart as a fault-tolerance method for OpenSHMEM, propose a selective checkpoint/restart fault-tolerance model, and discuss challenges associated with implementing the proposed model.

Keywords:

Parallel computing
Programming paradigm
Software
Partitioned global address space
Fault tolerance
Scalability
Engineering
Distributed computing
Supercomputer
hybrid programming

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations