Joint Progressive Network and Datacenter Recovery After Large-Scale Disasters

2020 
Large-scale disasters affecting both network and datacenter (DC) infrastructures can cause severe disruptions in cloud-based services. During post-disaster recovery, repairs are usually carried out in stages in a progressive manner due to limited repair resource availability. The order in which network elements and DCs are repaired can significantly impact users’ reachability to important contents/services. We investigate joint progressive network and DC recovery in which network recovery and DC recovery are conducted in a coordinated manner such that users have access to the maximum possible amount of contents/services at each repair stage. We first solve the optimization problem of joint progressive recovery to find the optimal sequence of network element and DC repairs with the objective to maximize cumulative weighted content reachability in the network. We then propose a scalable heuristic for scheduling the sequential repair of network nodes/links and DCs. Our model assumes that, at each repair stage, one network node with adjacent links and one DC can be fully repaired; however, full recovery may not be guaranteed due to limited resource availability. Hence, we also propose a “resource-aware” approach (with two resource-allocation strategies, namely “selective allocation” and “adaptive allocation”), which considers both full and partial recovery of elements based on available resources at each stage. We show that, compared to disjoint progressive recovery approach, in which network recovery and DC recovery plans are independent, our joint progressive recovery approach provides significantly higher per-stage content reachability in the network.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    22
    References
    8
    Citations
    NaN
    KQI
    []