Bad Nodes Considered Harmful: How to Find and Fix the Problem

Marco Seiz,Johannes Hötzer,Henrik Hierl,Stefan Andersson,Britta Nestler

Bad Nodes Considered Harmful: How to Find and Fix the Problem

2020

Marco Seiz
Johannes Hötzer
Henrik Hierl
Stefan Andersson
Britta Nestler

Large, distributed systems of computing units are the current state of the art for conducting high-performance computing. With large systems comes an increasing chance of failure of any component in the system, necessitating research as how to cope with failure. Failures may manifest as compute nodes shutting down, but also in differing performance among compute nodes. This chapter concerns itself with investigating a recent occurrence of the latter and how to avoid this in large scale runs.

Keywords:

Computer security
Considered harmful
Computer science

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations