Proactive detection of software aging mechanisms in performance critical computers

2002 
Software aging is a phenomenon, usually caused by resource contention, that can cause mission critical and business critical computer systems to hang, panic, or suffer performance degradation. If the incipience or onset of software aging mechanisms can be reliably detected in advance of performance degradation, corrective actions can be taken to prevent system hangs, or dynamic failover events can be triggered in fault tolerant systems. In the 1990 's the U.S. Dept. of Energy and NASA funded development of an advanced statistical pattern recognition method called the multivariate state estimation technique (MSET) for proactive online detection of dynamic sensor and signal anomalies in nuclear power plants and Space Shuttle Main Engine telemetry data. The present investigation was undertaken to investigate the feasibility and practicability of applying MSET for realtime proactive detection of software aging mechanisms in complex, multiCPU servers. The procedure uses MSET for model based parameter estimation in conjunction with statistical fault detection and Bayesian fault decision processing. A realtime software telemetry harness was designed to continuously sample over 50 performance metrics related to computer system load, throughput, queue lengths, and transaction latencies. A series of fault injection experiments was conducted using a "memory leak" injector tool with controllable parasitic resource consumption rates. MSET was able to reliably detect the onset of resource contention problems with high sensitivity and excellent false-alarm avoidance. Spin-off applications of this NASA-funded innovation for business critical eCommerce servers are described.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    14
    References
    41
    Citations
    NaN
    KQI
    []