Diagnosing missing events in distributed systems with negative provenance

2015 
When debugging a distributed system, it is sometimes necessary to explain the absence of an event - for instance, why a certain route is not available, or why a certain packet did not arrive. Existing debuggers offer some support for explaining the presence of events, usually by providing the equivalent of a backtrace in conventional debuggers, but they are not very good at answering 'Why not?' questions: there is simply no starting point for a possible backtrace. In this paper, we show that the concept of negative provenance can be used to explain the absence of events in distributed systems. Negative provenance relies on counterfactual reasoning to identify the conditions under which the missing event could have occurred. We define a formal model of negative provenance for distributed systems, and we present the design of a system called Y! that tracks both positive and negative provenance and can use them to answer diagnostic queries. We describe how we have used Y! to debug several realistic problems in two application domains: software-defined networks and BGP interdomain routing. Results from our experimental evaluation show that the overhead of Y! is moderate.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    30
    References
    48
    Citations
    NaN
    KQI
    []