CRAT: Enabling Coordinated Register Allocation and Thread-Level Parallelism Optimization for GPUs

2018 
The key to the high performance on GPUs lies in the massive threading to enable thread switching and hide long latencies. GPUs are equipped with a large register file to enable fast context switch. However, thread throttling techniques that are designed to mitigate cache contention, lead to under-utilization of registers. Register allocation is a significant factor for performance as it not just determines the single-thread performance, but indirectly affects the TLP. In this paper, we propose Coordinated Register Allocation and Thread-level parallelism ( CRAT ) to explore the optimization space of register allocation and TLP management on GPUs. CRAT employs both compile-time(CRAT-static) and run-time techniques(CRAT-dyn) to exhaust the design space. CRAT-static works statically to explore TLP and register allocation trade-off and CRAT-dyn exploits dynamic register allocation for further improvement. Experiments indicate that CRAT-static achieves an average 1.25X speedup over existing TLP management technique. On four register-limited applications, CRAT-dyn further improves the performance speedup of CRAT-static from 1.51X to 1.70X.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    56
    References
    6
    Citations
    NaN
    KQI
    []