A Library for Portable and Composable Data Locality Optimizations for NUMA Systems

2017 
Many recent multiprocessor systems are realized with a nonuniform memory architecture (NUMA) and accesses to remote memory locations take more time than local memory accesses. Optimizing NUMA memory system performance is difficult and costly for three principal reasons: (1) Today’s programming languages/libraries have no explicit support for NUMA systems, (2) NUMA optimizations are not portable, and (3) optimizations are not composable (i.e., they can become ineffective or worsen performance in environments that support composable parallel software). This article presents TBB-NUMA, a parallel programming library based on Intel Threading Building Blocks (TBB) that supports portable and composable NUMA-aware programming. TBB-NUMA provides a model of task affinity that captures a programmer’s insights on mapping tasks to resources. NUMA-awareness affects all layers of the library (i.e., resource management, task scheduling, and high-level parallel algorithm templates) and requires close coupling between all these layers. Optimizations implemented with TBB-NUMA (for a set of standard benchmark programs) result in up to 44% performance improvement over standard TBB. But more important, optimized programs are portable across different NUMA architectures and preserve data locality also when composed with other parallel computations sharing the same resource management layer.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    39
    References
    10
    Citations
    NaN
    KQI
    []