Mitigating I/O Impact of Checkpointing on Large Scale Parallel Systems
2018
Checkpointing is the most widely used technique in high performance computing systems to tolerate fail-stop errors and ensure reliable execution of parallel applications. However, with the scaling up of high performance computers, the number of processors and computing nodes increase rapidly, which brings I/O impact of checkpointing to the systems. On arriving at a checkpoint, all the nodes generate checkpoint data and write them to the storage system simultaneously, causing burst and massive traffics and data to the I/O infrastructure including interconnection network, parallel file system and storage. To mitigate the I/O impact of checkpointing, this paper proposes a self-adaptive random delay approach to control the writing of checkpointing data. By generating checkpoint data simultaneously in each node and writing the data according to a self-adaptive random delay policy, the burst traffic and data are smoothed. Experiment and theoretical analysis results show that this approach can mitigate I/O impact of checkpointing on large scale parallel systems.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
12
References
0
Citations
NaN
KQI