Towards Automatic Root Cause Diagnosis of Persistent Packet Loss in Cloud Overlay Network

2022 
Persistent packet loss in the cloud-scale overlay network severely compromises tenant experiences. Cloud providers are keen to diagnose such problems efficiently. However, existing work is either designed for the physical network or insufficient to present the concrete reason of packet loss. We propose to record and analyze the on-site forwarding condition of packets during packet-level tracing. The cloud-scale overlay network presents great challenges to achieve this goal with its high network complexity, multi-tenant nature, and diversity of root causes. To address these challenges, we present VTrace, an automatic diagnostic system for persistent packet loss over the cloud-scale overlay network. Utilizing the “fast path-slow path” structure of virtual forwarding devices (VFDs), e.g., vSwitches, VTrace installs several “coloring-matching-logging” rules in VFDs to selectively track the target packets and inspect them in depth. The detailed forwarding situation at each hop is logged and then assembled to perform analysis with an efficient path reconstruction scheme. Experiments are conducted to demonstrate VTrace’s low overhead and quick response. Besides, based on the idea “coloring-matching-counting”, VTrace can be easily extended to VTrace-stats to identify the culprit device for transient packet loss. We share experiences of how VTrace and VTrace-stats efficiently work after deploying them in Alibaba Cloud for years.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    24
    References
    0
    Citations
    NaN
    KQI
    []