Numerical fault tolerant strategies for resilient parallel eigensolvers

Emmanuel Agullo,Luc Giraud,Pablo Salas,Mawussi Zounon

Numerical fault tolerant strategies for resilient parallel eigensolvers

2016

The solution of large eigenproblems is involved in many scientific and engineering applications when, for instance stability analysis is a concern. For large simulation in material physics or thermo-acoustics, the calculation can last for many hours on large parallel platforms. However, on future large-scale systems, the mean time between failures (MTBF) of the system is expected to decrease so that many faults could occur during the solution of large eigenproblems. Consequently, it becomes critical to design parallel eigensolvers that can survive faults. In that framework, we investigate the relevance of approaches relying on numerical techniques, which might be combined with more classical techniques for real large scale parallel implementations. To focus on numerical remedies, we assume that a separate mechanism ensures the fault detection and that a system layer provides support for setting back the environment (processes, . . . ) in a running state. Once the system is in a running state, after a fault, our main objective is to provide robust resilient schemes so that the eigensolver may keep converging in the presence of the fault without restarting the calculation from scratch. For this purpose, we extend the interpolation-restart (IR) strategies initially introduced in [1, 2] for the solution of linear systems in a previous work to the solution of eigenproblems. Our strategy consists in extracting relevant spectral information from available data after a fault. After data extraction, a well selected part of the missing data is regenerated through interpolation strategies to constitute meaningful input to restart the numerical algorithm. The main feature of this numerical remedy is that, on one hand, it does not require extra resources, i.e., computational unit or computing time, when no fault occurs. On the other hand, for a local fault, i.e. a fault on a single node in distributed environment, the recovery is performed locally. We revisit a few state-of-the-art methods for solving large sparse eigenvalue problems namely the Arnoldi methods, subspace iteration methods and the Jacobi-Davidson method, in the light of our fault tolerant strategies. For each considered eigensolver, we adapt the strategies to regenerate as much spectral information as possible. Through extensive numerical experiments reported in [3], we study the respective robustness of the resulting resilient schemes with respect to the MTBF and to the amount of data loss via qualitative and quantitative illustrations.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations