High-performance dense tucker decomposition on GPU clusters

Jee W. Choi,Xing Liu,Venkatesan T. Chakaravarthy

High-performance dense tucker decomposition on GPU clusters

2018

The dense Tucker decomposition method is one of the most popular algorithms for analyzing and compressing data with multi-way relationship. Its execution time is typically dominated by dense matrix multiplication operations, which makes it well-suited for GPU acceleration. State-of-the-art distributed dense Tucker implementations for CPU clusters adopt multi-dimensional partitioning that optimizes for storage and communication. This, however, leads to smaller matrix dimensions that result in under-utilizing the GPU resources. In this paper, we present our optimized implementation and performance analysis of dense Tucker decomposition on a multi-GPU cluster. We propose three key optimizations: a new partitioning strategy that improves performance for GPUs, a new tensor matricization layout that halves the number of communication and matricization steps, and a variation of the randomized SVD algorithm to overcome the eigenvalue calculation bottleneck that arises from the high speedup gained from GPU acceleration. When compared to the state-of-the-art TuckerMPI library, our best GPU implementation, which employs all three optimizations described above, achieves up to 11.8X speedup on 64 nodes. Our best CPU implementation, which also employs all three optimizations, achieves up to 3.6X speedup over TuckerMPI on 64 nodes. When we compare our best GPU implementation to our best CPU implementation, the speedup ranges from 2.1X to 3. 6X on a single node, and from 1.8X to 3.3X on 64 nodes, depending on the input data set.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations