CoFI: Consistency-Guided Fault Injection for Cloud Systems

2020 
Network partitions are inevitable in large-scale cloud systems. Despite developer's efforts in handling network partitions throughout designing, implementing and testing cloud systems, bugs caused by network partitions, i.e., partition bugs, still exist and cause severe failures in production clusters. It is challenging to expose these partition bugs because they often require network partitions to start and stop at specific timings. In this paper, we propose Consistency-Guided Fault Injection (CoFI), a novel technique that systematically injects network partitions to effectively expose partition bugs. We observe that, network partitions can leave cloud systems in inconsistent states, where partition bugs are more likely to occur. Based on this observation, CoFI first infers invariants (i.e., consistent states) among different nodes in a cloud system. Once detecting violations to the inferred invariants (i.e., inconsistent states) while running the cloud system, CoFI injects network partitions to prevent the cloud system from recovering back to consistent states, and thoroughly tests whether the cloud system still proceeds correctly at inconsistent states. We have applied CoFI to three widely-deployed cloud systems, i.e., Cassandra, HDFS, and YARN. CoFI has detected 12 previously-unknown bugs, and four of them have been confirmed by developers.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    32
    References
    2
    Citations
    NaN
    KQI
    []