FaultSee: Reproducible Fault Injection in Distributed Systems

2020 
Distributed systems are increasingly important in modern society, often operating on a global scale with stringent dependability requirements. Despite the vast amount of research and the development of techniques to build dependable systems, faults are inevitable as one can witness from regular failures of major providers of IT services. It is therefore fundamental to evaluate distributed systems under different fault patterns and adversarial conditions to assess their high-level behaviour and minimize the occurrence of failures. However, succinctly capturing the system configuration, environment, fault patterns and other variables affecting an experiment is very hard, leading to a reproducibility crisis. In this paper we propose the FaultSee toolkit. The two components of FaultSee are (1) the simple and descriptive FDSL language that captures the system, environment, workload and fault pattern characteristics; and (2) an easy-to-use platform to deploy and run the experiments described by the language. FaultSee allows to precisely describe and reproduce experiments and leads to a better assessment the impact of faults in distributed systems. We showcase the key features of FaultSee by studying the impact of faults with real deployments of Apache Cassandra and BFT-Smart.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    19
    References
    0
    Citations
    NaN
    KQI
    []