FluxRank: A Widely-Deployable Framework to Automatically Localizing Root Cause Machines for Software Service Failure Mitigation

2019 
The failures of software service directly affect user experiences and service revenue. Thus operators monitor both service-level KPIs (e.g., response time) and machine-level KPIs (e.g., CPU usage) on each machine underlying the service. When a service fails, the operators must localize the root cause machines, and mitigate the failure as quickly as possible. Existing approaches have limited application due to the difficulty to obtain the required additional measurement data. As a result, failure localization is largely manual and very time-consuming. This paper presents FluxRank, a widely-deployable framework that can automatically and accurately localize the root cause machines, so that some actions can be triggered to mitigate the service failure. Our evaluation using historical cases from five real services (with tens of thousands of machines) of a top search company shows that the root cause machines are ranked top 1 (top 3) for 55 (66) cases out of 70 cases. Comparing to existing approaches, FluxRank cuts the localization time by more than 80% on average. FluxRank has been deployed online at one Internet service and six banking services for three months, and correctly localized the root cause machines as the top 1 for 55 cases out of 59 cases.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    68
    References
    4
    Citations
    NaN
    KQI
    []