Light-weight black-box failure detection for distributed systems

2012 
Detecting failures in distributed systems is challenging, as modern datacenters run a variety of applications. Current techniques for detecting failures often require training, have limited scalability, or have results that are hard to interpret. We present LFD, a light-weight technique to quickly detect performance problems in distributed systems using only correlations of OS metrics. LFD is based on our hypothesis of server application behavior, does not require training, and detects failures with complexity linear in the number of nodes, with results that are interpretable by sysadmins. We further show that LFD is versatile, and can diagnose faults in Hadoop MapReduce systems and on multi-tier web request systems, and show how LFD is intuitive to sysadmins.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    18
    References
    10
    Citations
    NaN
    KQI
    []