A Fast, General Storage Replication Protocol for Active-Active Virtual Machine Fault Tolerance

2017 
Cloud computing enables more and more online services deployed in virtual machines (VMs), making fast VM fault tolerance particularly crucial. Unfortunately, despite much effort, achieving fast VM fault tolerance remains an open problem. A traditional way to provide VM fault tolerance is the active-passive approach, which frequently transfers tremendous updated states, including memory and storage, of a primary VM to a suspended secondary VM. The other emerging approach, namely the active-active approach, runs the secondary VM concurrently with the primary. Compared to active-passive, active-active is faster because it only performs the transfer when the externally visible states (e.g., network outputs) of the primary and secondary diverge. However, active-active aggravates the performance issue on I/O intensive workloads. In existing active-active systems, storage replication protocols hold updated storage states from both the primary and secondary on the secondary, incurring excessive I/O contention. For instance, both our evaluation and prior study show that a well-engineered active-active system, COLO, degrades the throughput of I/O intensive services by up to 61.6%. To tackle this open problem, this paper presents GANNET, a fast and general storage replication protocol for active-active VM fault tolerance systems. It greatly alleviates the I/O contention on the secondary's storage by efficiently buffering the updated disk states from both the primary and secondary VM in memory. GANNET carries a lightweight storage checkpoint algorithm to avoid consuming too much memory. GANNET is proved to be as reliable as existing storage replication protocols. We integrated GANNET into two popular active-active systems. Evaluation on six widely used services shows that GANNET incurred 15.9% overhead compared to the native executions and outperformed COLO's storage replication protocol by 1.2X?2.6X. GANNET's source code is available at github.com/hku-systems/gannet.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    18
    References
    2
    Citations
    NaN
    KQI
    []