Checkpointing in Distributed Computing Systems

2002 
In this chapter, we present a message optimal non-intrusive checkpointing protocol for nondeterministic message passing distributed computing systems that does not require global time. Checkpoints in distributed systems can be coordinated, independent or quasi-synchronous. Coordinated checkpointing is attractive due to simple recovery, domino-freeness and optimal stable storage requirement. The quasi-synchronous checkpointing approach is also domino-free but may force processes to take multiple checkpoints. Independent checkpointing requires multiple local checkpoints of each node to be stored on stable storage and is affected by “domino effect”. Coordinated checkpointing has been found better than independent checkpointing as it is domino-free and has minimum storage and performance overheads. So far, many coordinated checkpointing protocols have been proposed in literature for distributed computing systems. These protocols can be broadly classified as minimum process intrusive checkpointing protocols and non-intrusive checkpointing protocols. In this chapter, we present a non-intrusive coordinated checkpointing protocol for distributed systems with least failure-free overhead. The proposed checkpointing algorithm has optimal communication and storage overheads. It requires only O(n) extra messages for taking a global consistent checkpoint. We introduce the concept of “odd” and “even” checkpoint intervals that replace the checkpoint sequence numbers that are generally piggybacked with each message to avoid orphan messages.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    21
    References
    12
    Citations
    NaN
    KQI
    []