Resilient computational applications using Coarray Fortran

2019 
Abstract With the increase in the number of hardware components and layers of the software stack in High Performance Computing (HPC) there will likely be an increment in number of hardware and software failures, which will be user-visible. Even under the most optimistic assumptions about the individual components reliability, probabilistic amplification from using millions of nodes has a dramatic impact on the Mean Time Between Failure (MTBF) of the entire platform. Although several techniques to address this problem have been developed, the support provided by the programming model, for the user to mitigate or work around this issue, is still insufficient. The Fortran 2018 standard defines failed images , a new feature that allows the programmer to detect and manage image failures in a parallel program. In this paper we show how to use failed images and teams , another feature defined in the Fortran 2018 standard, to implement resilient computational applications.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    20
    References
    7
    Citations
    NaN
    KQI
    []