Low Overhead Soft Error Mitigation Methodologies

2018 
CMOS technology scaling is bringing new challenges to the designers in the form of new failure modes. The challenges include long term reliability failures and particle strike induced random failures. Studies have shown that increasingly, the largest contributor to the device reliability failures will be soft errors. Due to reliability concerns, the adoption of soft error mitigation techniques is on the increase. As the soft error mitigation techniques are increasingly adopted, the area and performance overhead incurred in their implementation also becomes pertinent. This thesis addresses the problem of providing low cost soft error mitigation. The main contributions of this thesis include, (i) proposal of a new delayed capture methodology for low overhead soft error detection, (ii) adopting Error Control Coding (ECC) for delayed capture methodology for correction of single event upsets, (iii) analyzing the impact of different derating factors to reduce the hardware overhead incurred by the above implementations, and (iv) proposal for hardware software co-design for reliability based upon critical component identification determined by the application executing on the hardware (as against standalone hardware analysis). This thesis first surveys existing soft error mitigation techniques and their associated limitations. It proposes a new delayed capture methodology as a low overhead soft error detection technique. Delayed capture methodology is an enhancement of the Razor flip-flop methodology. In the delayed capture methodology, the parity for a set of flip-flops is calculated at their inputs and outputs. The input parity is latched on a second clock, which is delayed with respect to the functional clock by more than the soft error pulse width. It requires an extra flip-flop for each set of flip-flops. On the other hand, in the Razor flip-flop methodology an additional flip-flop is required for every functional flip-flop. Due to the skew in the clocks, either the parity flip-flop or the functional flip-flop will capture the effect of transient, and hence by comparing the output parity and latched input parity an error can be detected. Fault injection experiments are performed to evaluate the bneefits and limitations of the proposed approach. The limitations include soft error detection escapes and lack of error correction capability. Different cases of soft error detection escapes are analyzed. They are attributed mainly to a Single Event Upset (SEU) causing multiple flip-flops within a group to be in error. The error space due to SEUs is analyzed and an intelligent flip-flop grouping method using graph theoretic formulations is proposed such that no SEU can cause multiple flip-flops within a group to be in error. Once the error occurs, leaving the correction aspects to the application may not be desirable. The proposed delayed capture methodology is extended to replace parity codes with codes having higher redundancy to enable correction. The hardware overhead due to the proposed methodology is analyzed and an…
    • Correction
    • Cite
    • Save
    • Machine Reading By IdeaReader
    59
    References
    0
    Citations
    NaN
    KQI
    []