A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems

2019 
Abstract We propose a new algorithm to improve the strong scalability of right-looking sparse LU factorization on distributed memory systems. Our 3D algorithm for sparse LU uses a three-dimensional MPI process grid, exploits elimination tree parallelism, and trades off increased memory for reduced per-process communication. We also analyze the asymptotic improvements for planar graphs (e.g., those arising from 2D grid or mesh discretizations) and certain non-planar graphs (specifically for 3D grids and meshes). For a planar graph with n vertices, our algorithm reduces communication volume asymptotically in n by a factor of O log n and latency by a factor of O log n . For non-planar cases, our algorithm can reduce the per-process communication volume by 3 × and latency by O n 1 3 times. In all cases, the memory needed to achieve these gains is a constant factor. We implemented our algorithm by extending the 2D data structure used in SuperLU_DIST . Our new 3D code achieves empirical speedups up to 27 × for planar graphs and up to 3.3 × for non-planar graphs over the baseline 2D SuperLU_DIST  when run on 24,000 cores of a Cray XC30. We extend the 3D algorithm for heterogeneous architectures by adding the Highly Asynchronous Lazy Offload ( Halo ) algorithm for co-processor offload [44] . On 4096 nodes of a Cray XK7 with 32,768 CPU cores and 4096 Nvidia K20x GPUs, the 3D algorithm achieves empirical speedups up to 24 × for planar graphs and 3.5 × for non-planar graphs over the baseline 2D SuperLU_DIST with co-processor acceleration.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    47
    References
    4
    Citations
    NaN
    KQI
    []