Analyzing DUE Errors on GPUs With Neutron Irradiation Test and Fault Injection to Control Flow

2021 
As GPU applications expand, the reliability of GPU is drawing more attention since even reliability-demanding applications are executed on GPUs. Silent data corruption (SDC) is widely studied both in irradiation experiments and fault injection experiments. On the other hand, detectable uncorrected error (DUE) is not well studied. This work focuses on DUEs reported by the GPU driver and analyzes those observed in fault injection and neutron irradiation experiments, where faults are injected in the control flow to change the program counter value unexpectedly. The DUE errors of GPU engine exception, GPU memory page fault, and GPU processing stop are observed in both the experiments. On the other hand, the DUE error categorized as internal microcontroller halt by the GPU driver, which is not found in the fault injection experiment, is observed frequently, suggesting the necessity of investigating the failures originating from the faults in the components invisible to programmers.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    19
    References
    0
    Citations
    NaN
    KQI
    []