A Computational Stack for Cross-Domain Acceleration

2021 
Domain-specific accelerators obtain performance benefits by restricting their algorithmic domain. These accelerators utilize specialized languages constrained to particular hardware, thus trading off expressiveness for high performance. The pendulum has swung from one hardware (general-purpose processors) for all domains to the opposite end, i.e., one hardware per individual domain. The middle-ground on this spectrum–which provides a unified computational stack across multiple, but not all, domains–is an emerging and open research challenge. This paper sets out to explore this region and its associated tradeoff between expressiveness and performance by defining a cross-domain stack, dubbed PolyMath. This stack defines a high-level cross-domain language (CDL), called PMLang, that in a modular and reusable manner encapsulates mathematical properties to be expressive across multiple domains–Robotics, Graph Analytics, Digital Signal Processing, Deep Learning, and Data Analytics. PMLang is backed by a recursively-defined intermediate representation allowing simultaneous access to all levels of operation granularity, dubbed srDFG. Accelerator-specific or domain-specific IRs commonly capture operations in the granularity that best fits on sets of Domain-Specific Architectures (DSAs). In contrast, the recursive nature of our srDFG IR enables simultaneous access to all the granularities of computation for every operation, thus forming the ideal bridge for converting to various DSA-specific IRs across multiple domains. Consequently, our stack, unlocks multi-acceleration for end-to-end applications that cross the boundary of multiple domains each comprising different data and compute patterns. Experimental evaluations show that by using PolyMath it is possible to harness accelerators across the five domains to realize an average speedup of 3.3× over a Xeon CPU along with 18.1× reduction in energy. In comparison to Jetson Xavier and Titan XP, cross-domain acceleration offers 1.7× and 7.2× improvement in performance-per-watt, respectively. We measure the cross-domain expressiveness and performance tradeoff by comparing each benchmark against its hand-optimized implementation to achieve 83.9% and 76.8% of the optimal performance for single-domain algorithms and end-to-end applications. For the two case studies of end-to-end applications (comprising algorithms from multiple domains), results show that accelerating all the kernels offers an additional 2.0× speedup over CPU, 6.1× improvement in performance-per watt over Titan Xp, and 2.8× speedup over Jetson Xavier vs when only one most effective single-domain kernel was accelerated. Finally, we examine the utility and expressiveness of PolyMath through a user study, which shows, on average, PolyMath requires 1.9× less time to implement algorithms from two different domains with 2.5× fewer lines of code relative to Python.
    • Correction
    • Cite
    • Save
    • Machine Reading By IdeaReader
    55
    References
    0
    Citations
    NaN
    KQI
    []