Understanding Exception-Related Bugs in Large-Scale Cloud Systems

2019 
Exception mechanism is widely used in cloud systems. This is mainly because it separates the error handling code from main business logic. However, the huge space of potential error conditions and the sophisticated logic of cloud systems present a big hurdle to the correct use of exception mechanism. As a result, mistakes in the exception use may lead to severe consequences, such as system downtime and data loss. To address this issue, the communities direly need a better understanding of the exception-related bugs, i.e., eBugs, which are caused by the incorrect use of exception mechanism, in cloud systems. In this paper, we present a comprehensive study on 210 eBugs from six widely-deployed cloud systems, including Cassandra, HBase, HDFS, Hadoop MapReduce, YARN, and ZooKeeper. For all the studied eBugs, we analyze their triggering conditions, root causes, bug impacts, and their relations. To the best of our knowledge, this is the first study on eBugs in cloud systems, and the first one that focuses on triggering conditions. We find that eBugs are severe in cloud systems: 74% of our studied eBugs affect system availability or integrity. Luckily, exposing eBugs through testing is possible: 54% of the eBugs are triggered by non-semantic conditions, such as network errors; 40% of the eBugs can be triggered by simulating the triggering conditions at simple system states. Furthermore, we find that the triggering conditions are useful for detecting eBugs. Based on such relevant findings, we build a static analysis tool, called DIET, and apply it to the latest versions of the studied systems. Our results show that DIET reports 31 bugs and bad practices, and 23 of them are confirmed by the developers as "previously-unknown" ones.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    37
    References
    9
    Citations
    NaN
    KQI
    []