Bad Nodes Considered Harmful: How to Find and Fix the Problem

2020 
Large, distributed systems of computing units are the current state of the art for conducting high-performance computing. With large systems comes an increasing chance of failure of any component in the system, necessitating research as how to cope with failure. Failures may manifest as compute nodes shutting down, but also in differing performance among compute nodes. This chapter concerns itself with investigating a recent occurrence of the latter and how to avoid this in large scale runs.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    14
    References
    0
    Citations
    NaN
    KQI
    []