Automatically and Adaptively Identifying Severe Alerts for Online Service Systems

2020 
In large-scale online service system, to enhance the quality of services, engineers need to collect various monitoring data and write many rules to trigger alerts. However, the number of alerts is way more than what on-call engineers can properly investigate. Thus, in practice, alerts are classified into several priority levels using manual rules, and on-call engineers primarily focus on handling the alerts with the highest priority level (i.e., severe alerts). Unfortunately, due to the complex and dynamic nature of the online services, this rule-based approach results in missed severe alerts or wasted troubleshooting time on non-severe alerts. In this paper, we propose AlertRank, an automatic and adaptive framework for identifying severe alerts. Specifically, AlertRank extracts a set of powerful and interpretable features (textual and temporal alert features, univariate and multivariate anomaly features for monitoring metrics), adopts XGBoost ranking algorithm to identify the severe alerts out of all incoming alerts, and uses novel methods to obtain labels for both training and testing. Experiments on the datasets from a top global commercial bank demonstrate that AlertRank is effective and achieves the F1-score of 0.89 on average, outperforming all baselines. The feedback from practice shows AlertRank can significantly save the manual efforts for on-call engineers.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    37
    References
    8
    Citations
    NaN
    KQI
    []