Full-system critical-path analysis and performance prediction

2009 
Many important workloads today, such as web-hosted services, are limited not by processor core performance but by interactions among the cores, the memory system, I/O devices, and the complex software layers that tie these components together. Architects who optimize system designs for these workloads are challenged to identify performance bottlenecks before the systems are built. This identification is challenging because, as in any concurrent system, overheads in one component may be hidden due to overlapping with other operations. These overlaps span the user/kernel and software/hardware boundaries, making traditional tools inadequate. Common software profiling techniques cannot account for hardware bottlenecks or situations in which software overheads are hidden due to overlapping with hardware operations. Critical-path analysis is a powerful approach for identifying bottlenecks in highly concurrent systems, but typically requires detailed domain knowledge to construct the required event dependence graphs. As a result, to date it has been applied only to isolated system layers (e.g., processor micro-architectures or message-passing applications). This thesis presents a methodology for identifying true end-to-end critical paths in systems composed of multiple layers of hardware and software, particularly in the domain of high-speed networking. The state machines that implicitly or explicitly govern the behavior of all the layers are modeled and their local interactions captured to build an end-to-end dependence graph that can be used to locate bottlenecks. This is done incrementally, with modest effort and only local understanding. Furthermore, it is shown that queue-based interactions are necessary and sufficient to capture information from complex protocols, multiple connections and multiple processors. The resulting dependence graph is created and analyzed distilling the huge amount of collected data into a set bottleneck locations including where the most un-overlapped time is spent, and locations where the addition of some buffering could improve the systems performance without any other optimizations. Additionally, this technique provides accurate quantitative predictions of the benefit of eliminating bottlenecks. The end result of this analysis, minutes after the data is gathered, is: (1) the identity of the component that causes the bottleneck; (2) the extent to which a component must be improved before it is no longer the bottleneck; (3) the next bottleneck that will be exposed in the system; and (4) the performance improvement that will occur before the next bottleneck is reached The analysis can be repeated for successive bottlenecks and is far faster than the available alternatives which include: studying huge amounts of statistics to attempt to understand the problem, developing and prototyping a solution, and simulating the solution. The traditional approach could take days or weeks of implementation and simulation and does not provide any certainty that the correct bottleneck has been found until all the work to mitigate it is complete.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    54
    References
    3
    Citations
    NaN
    KQI
    []