On the performance and reliability of fault-tolerant scalable coherent interface networks

1999 
Mission-critical, critical-computation, and other demanding applications can benefit from low-latency, high-bandwidth interconnects. These applications also demand fault tolerance to maintain connectivity of all data processing nodes. From the emerging technologies of high-speed interconnects, the Scalable Coherent Interface has appeared at the forefront, providing high data throughputs while supporting a simple, easily adaptable design. The speed has been exploited in several multiprocessor systems and is becoming increasingly popular among the cluster-computing community. SCI can also be applied to non-computing applications such as aircraft avionics. Unfortunately, such applications require a high degree of fault tolerance that SCI currently does not provide. Currently, SCI lacks the fault-tolerance protocols needed to implement reconfigurable interconnects. In this dissertation, the issues of defining and implementing the fault-tolerance protocols are presented with the ultimate goal of creating reconfigurable, resilient SCI interconnects. This goal is arrived upon using three major stepping stones. The first is the development of a high-fidelity SCI model that can be used to construct and simulate any standard ring-based topology with distributed switching. The results presented are the first high-fidelity simulations of SCI multiprocessor networks with k-ary n-cube topologies. The studies concentrate on the performance scalability of SCI in terms of throughput and latency versus topology size and dimension. The second step in achieving the goal is the extension of the existing SCI fault-tolerance protocols defined by the standard. Implementing these protocols requires hardware additions in the form of fault handlers to detect errors and create, receive, and interpret diagnostic and control packets. The third and final step to achieve the goal is the development of techniques to determine the reliability of the fault-tolerant SCI topologies. Previous k-ary n-cube reliability models were based on the assumption that link failures were independent of one another. This assumption is not valid with ring-based interconnects where a single link failure results in the failure of an entire ringlet within the topology. The reliability model presented is based on the failure probabilities of the ringlets comprising the topology, rather than the failure probabilities of each individual link.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []