Improving Failure Tolerance in Large-Scale Cloud Computing Systems

2019 
Large-scale cloud computing systems have served as the fundamental supporting platform for big data, Internet of Things, and artificial intelligence applications for the past decade. With the scale and complexity of these systems increasing dramatically, various hardware and software failures will inevitably occur and may not be detected and repaired in a timely manner. Besides, sophisticated architectural features of cloud computing may also have an adverse impact on system reliability. In response to these challenges, this paper proposes a simulation-driven framework based on real cloud computing system operation logs for improving failure tolerance in large-scale cloud computing systems. For a given cloud computing system, we first conduct a systematic analysis of its structure and operation characteristics. A Markov-based model is used to examine the system's potential failures, assess their severities, and suggest quick recoveries. During this process, the proposed reliability-aware resource scheduling algorithm is adopted to optimize resources so that the system's reliability can be improved cost-effectively. We also report a case study to demonstrate the application of our algorithm in improving failure tolerance of a large-scale cloud computing system.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    43
    References
    10
    Citations
    NaN
    KQI
    []