Toward fault-tolerant hybrid programming over large-scale heterogeneous clusters via checkpointing/restart optimization

2017 
Massively heterogeneous architectures are widely adopted for the design of modern peta-scale and future exa-scale systems. In such heterogeneous clusters, due to the increasing number of involved components, it is essential to enable fault tolerance to improve the reliability of the whole system. However, existing programming models for heterogeneous clusters (e.g., MPI\(+\)X) concern more on performance, instead of reliability. In this paper, we design and implement a fault tolerance framework for hybrid programs that leverage heterogeneous hardware architectures based on the in-memory checkpointing technique. We provide new capabilities for programming heterogeneous applications that can greatly simplify the implementation of application-level checkpointing. We also conduct optimizations on checkpoint saving and loading to increase scalability. We validate effectiveness of the framework with various benchmarks and real-world applications on the Tianhe-2 supercomputer. Our experimental results show that our framework can improve the resilience of long-running applications and reduce checkpointing overhead.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    29
    References
    6
    Citations
    NaN
    KQI
    []