High-performance dense tucker decomposition on GPU clusters

2018 
The dense Tucker decomposition method is one of the most popular algorithms for analyzing and compressing data with multi-way relationship. Its execution time is typically dominated by dense matrix multiplication operations, which makes it well-suited for GPU acceleration. State-of-the-art distributed dense Tucker implementations for CPU clusters adopt multi-dimensional partitioning that optimizes for storage and communication. This, however, leads to smaller matrix dimensions that result in under-utilizing the GPU resources. In this paper, we present our optimized implementation and performance analysis of dense Tucker decomposition on a multi-GPU cluster. We propose three key optimizations: a new partitioning strategy that improves performance for GPUs, a new tensor matricization layout that halves the number of communication and matricization steps, and a variation of the randomized SVD algorithm to overcome the eigenvalue calculation bottleneck that arises from the high speedup gained from GPU acceleration. When compared to the state-of-the-art TuckerMPI library, our best GPU implementation, which employs all three optimizations described above, achieves up to 11.8X speedup on 64 nodes. Our best CPU implementation, which also employs all three optimizations, achieves up to 3.6X speedup over TuckerMPI on 64 nodes. When we compare our best GPU implementation to our best CPU implementation, the speedup ranges from 2.1X to 3. 6X on a single node, and from 1.8X to 3.3X on 64 nodes, depending on the input data set.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    23
    References
    13
    Citations
    NaN
    KQI
    []