Constructing Resiliant Communication Infrastructure for Runtime Environments

2009 
Next generation HPC platforms are expected to feature millions of cores distributed over hundreds of thousands of nodes, leading to scalability and fault-tolerance issues for both applications and runtime environments dedicated to run on such machines. Most parallel applications are developed using a communication API such as MPI, implemented in a library that runs on top of a dedicated runtime environment. Strong efforts have been made in the past decades to improve the performance, scalability and fault-tolerance at the library level. The most recent techniques propose to deal with failures locally, to avoid stopping and restarting the whole system. As a consequence, fault-tolerance becomes a critical property of the runtime environment. A runtime environment is a service of a parallel system to start and monitor applications. It is deployed on the parallel system by a launching service, usually following a spanning tree to improve scalability of the deployment. The first task of the runtime environment is then to build its own communication infrastructure to synchronize the tasks of the parallel application. A fault-tolerant runtime environment must detect failures, and coordinate with the application to recover from them. Communication infrastructures that are used today (e.g. trees and rings) are usually built in a centralized way and fail at providing support for fault-tolerance because a few failures lead with a high probability to disconnected components. Previous works [2] have demonstrated that the Binomial Graph topology (BMG) is a good candidate as a communication infrastructure for supporting both scalability and fault-tolerance for runtime environments. Roughly speaking, in a BMG, each process is the root of a binomial tree gathering all the processes. In this paper, we present and analyze a self-stabilizing algorithm1 to transform the underlying communication infrastructure provided by the launching service into a BMG, and maintain it in spite of failures. We demonstrate that this algorithm is scalable, tolerate transient failures, and adapt itself to topology changes.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    14
    References
    4
    Citations
    NaN
    KQI
    []