Understanding Congestion in High Performance Interconnection Networks Using Sampling

2018 
To improve the communication performance of an application executing on a cluster or supercomputer, developers need tools that enable them to understand how the application's communication patterns interact with the system's network, especially when those interactions result in congestion. Since communication performance is difficult to reason about analytically and simulation is costly, measurement-based approaches are needed. This paper describes a new sampling-based technique to collect information about the path a packet takes and congestion it encounters. We describe a variant of this scheme that requires only 5--6 bits of information in a monitored packet, making it practical for use in next-generation networks. Network simulations using communication traces for miniGhost (a synthetic 3D finite difference mini-application) and pF3D (a code that simulates laser-plasma interactions) show that our technique provides precise application-centric quantitative information about traffic and congestion that can be used to distinguish between problems with an application's communication patterns, its mapping onto a parallel system, and outside interference.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    36
    References
    7
    Citations
    NaN
    KQI
    []