Replicated processors on a single die – How independently do they fail?

2011 
A very popular and efficient method for achieving fault tolerance is replication of components paired with a comparison of their outputs. Systems-on-chip architectures enable a cost-efficient implementation of this scheme on a single die. The resulting close physical proximity of the replica, however, implies an increased coupling, and therefore single-die solutions are more susceptible to common-cause faults (CCFs) than equivalent multi-chip approaches. Unfortunately, no answer could be given so far, to which degree the coupling decreases the dependability gain accomplished by the replication even in a single-die solution. In this paper we analyze potential coupling mechanisms and study under which circumstances they lead to identical outputs in all replica, since exactly in this case the "replication and comparison" scheme will fail. We perform both, simulation studies as well as comprehensive experimental investigations to derive a quantitative answer to this question. Our particular focus is on thermal effects and on the effects of disturbances in a shared power supply in a duplicated processor architecture. Beyond observing the relative probability of occurrence of CCFs, we also study the effectiveness of several countermeasures against them. We elaborate a model to decompose the genesis of CCFs into several steps, and show that very tight local and temporal coincidence of the fault effect in both replica is crucial for a CCF, which is unlikely, e.g. in the case for thermal effects. As a general result it turns out that even small asymmetries between the cores yield a drastic reduction in the CCF probability.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    14
    References
    0
    Citations
    NaN
    KQI
    []