A lightweight approach to GPU resilience

Max Baird,Christian Fensch,Sven-Bodo Scholz,Artjoms Sinkarovs

A lightweight approach to GPU resilience

2018

Max Baird
Christian Fensch
Sven-Bodo Scholz
Artjoms Sinkarovs

Resilience for HPC applications typically is implemented as a CPU-based rollback-recovery technique. In this context, long running accelerator computations on GPUs pose a major challenge as these devices usually do not offer any means of interrupt. This paper proposes a solution to the aforementioned problem: it suggests a novel approach that rewrites GPU kernels so that a soft interrupt of their execution becomes possible. Our approach is based on the Compute Unified Device Architecture (CUDA) by Nvidia and works by taking advantage of CUDA’s execution model of partitioning threads into blocks. In essence, we re-write the kernel so that each block determines whether it should continue execution or return control to the CPU. By doing so we are able to perform a premature interrupt of kernels.

Keywords:

Computation
CUDA
Execution model
Kernel (linear algebra)
Parallel computing
Thread (computing)
Interrupt
Architecture
Central processing unit
Computer science

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations