A Probabilistic Fault-Tolerant Recovery Mechanism for Task and Result Certification of Large-Scale Distributed Applications

2009 
This paper deals with fault tolerant recovery mechanisms and probabilistic results certification issues on large scale architectures. The related works in the result certification domain are based on a total or a partial duplication of the application. However, they are limited to independent tasks executions. In the present work, we extend these mechanisms to dependant tasks applications. First of all we propose an approach, based on an abstract representation of a parallel execution called macro-dataflow graph. Second we introduce probabilistic certification algorithms that avoid the re-execution of the program, allowing for recovery on different platforms under different number of processors. We also sketch how to simulate our framework according to state of the art, modeling workloads and fault injection tools.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    13
    References
    0
    Citations
    NaN
    KQI
    []