rmalloc() and rpipe(): a uGNI-based Distributed Remote Memory Allocator and Access Library for One-sided Messaging

2018 
Optimizing communication is essential for high-performance computing because synchronization bottlenecks inhibit the overall performance and scalability of parallel applications. Today's cutting-edge computing hardware, as well as networking interfaces like Cray Aries/Gemini, features extremely low latency and high bandwidth remote memory access (RMA) operations for optimized data movement. However for any efficient data movement to occur between two logical processing units, software substrates must be able to properly exploit hardware resources for the underlying fabric. Overheads due to coarse granular synchronization and stalls during irregular access of remote memory regions may hint at two adverse effects of resource under-utilization in time and space. We introduce a uGNI-based distributed remote memory allocator called "rmalloc" which expands RDMA-enabled memory utilization, and a communication substrate called "rpipe" that tries to mitigate synchronization bottlenecks. Our UNIX-inspired RMA programming model is simple to use and equally applicable to both higher-level applications as well as lower-level runtime systems for enabling efficient data movement. Our micro-benchmark results suggest that "rmalloc" default next-fit allocator outperforms MPI-3.0 RMA by 1.5X and up to 6X in most cases, while other variants of "rmalloc" (i.e. best-fit, worst-fit) reduce external fragmentation and perform comparably or better than the default "rmalloc" allocator for irregular RMA.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    17
    References
    0
    Citations
    NaN
    KQI
    []