Implementation and Evaluation of CUDA-Unified Memory in Numba.

2020 
Python as a programming language is increasingly gaining importance, especially in data science, scientific, and parallel programming. With the Numba-CUDA, it is even possible to program GPUs with Python using a CUDA like programming style. However, Numba is missing support for CUDA-unified memory, which can help to simplify programming even more and allows dynamic work distribution between GPUs and CPUs. In this work, we implement and evaluate the support for unified memory in Numba. As expected, the performance of unified memory is worse than using explicit data transfers, but can outperform the performance of the implicit methods provided by Numba. Additionally, using unified memory can help to reduce the Python interpreter overhead and therefore help to improve the performance of small problem sizes. The use of system-wide atomic can help to improve the work distribution between GPU and CPU, but when using more CPU threads the performance suffers under the Python global interpreter lock (GIL).
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    12
    References
    0
    Citations
    NaN
    KQI
    []