Concurrent Detection of Failures in GPU Control Logic for Reliable Parallel Computing

2020 
The reliability of GPUs is becoming a major concern due to the increased probability of failures and the high vulnerability of GPUs compared to conventional CPUs in terms of tasks per failure. While there are extensive countermeasures against failures in GPU data units, there are fewer countermeasures for failures in GPU control logics. Currently, software-based techniques, such as inserting signature codes for detecting GPU control-logic failures by comparing the expected signature value with the current signature value, are being utilized. However, in the conventional software-based techniques, application calculations, signature calculations, and signature comparison calculations are executed in sequence, which degrades the application throughputs. We have developed a software-based technique that concurrently detects GPU control-logic failures in a running application while largely maintaining its throughput. Experimental results show that when our technique concurrently executed application calculations, signature calculations, and signature comparison calculations for a matrix multiplication application, the application throughput remains 78% of the original one, whereas 62% is reported in literature. We also developed fault injection simulators specialized for injecting GPU-specific control-logic faults into GPU intermediate codes and found that 100% of GPU-specific failures could be detected both during and after application execution. The proposed approach can be utilized for a wide variety of safety-and reliability-critical applications.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    12
    References
    0
    Citations
    NaN
    KQI
    []