Reliable, Efficient Recovery for Complex Services with Replicated Subsystems

2020 
Applications with internal substructure are common in the cloud, where many systems are organized as independently logged and replicated subsystems that interact via flows of objects or some form of RPC. Restarting such an application is difficult: a restart algorithm needs to efficiently provision the subsystems by mapping them to nodes with needed data and compute resources, while simultaneously guaranteeing that replicas are in distinct failure domains. Additional failures can occur during recovery, hence the restart process must itself be a restartable procedure. In this paper we present an algorithm for efficiently restarting a service composed of sharded subsystems, each using a replicated state machine model, into a state that (1) has the same fault-tolerance guarantees as the running system, (2) satisfies resource constraints and has all needed data to restart into a consistent state, (3) makes safe decisions about which updates to preserve from the logged state, (4) ensures that the restarted state will be mutually consistent across all subsystems and shards, and (5) ensures that no committed updates will be lost. If restart is not currently possible, the algorithm will await additional resources, then retry.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    26
    References
    1
    Citations
    NaN
    KQI
    []