Mitigating I/O Impact of Checkpointing on Large Scale Parallel Systems

Nana Wang,Qingzheng Sun,Yi Liu,Depei Qian

Mitigating I/O Impact of Checkpointing on Large Scale Parallel Systems

2018

Checkpointing is the most widely used technique in high performance computing systems to tolerate fail-stop errors and ensure reliable execution of parallel applications. However, with the scaling up of high performance computers, the number of processors and computing nodes increase rapidly, which brings I/O impact of checkpointing to the systems. On arriving at a checkpoint, all the nodes generate checkpoint data and write them to the storage system simultaneously, causing burst and massive traffics and data to the I/O infrastructure including interconnection network, parallel file system and storage. To mitigate the I/O impact of checkpointing, this paper proposes a self-adaptive random delay approach to control the writing of checkpointing data. By generating checkpoint data simultaneously in each node and writing the data according to a self-adaptive random delay policy, the burst traffic and data are smoothed. Experiment and theoretical analysis results show that this approach can mitigate I/O impact of checkpointing on large scale parallel systems.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations