Autotuning Tensor Transposition

Lai Wei,John M. Mellor-Crummey

Autotuning Tensor Transposition

2014

Lai Wei
John M. Mellor-Crummey

Tensor transposition, a generalization of matrix transposition, is an important primitive used when performing tensor contraction. Efficient implementation of tensor transposition for modern node architectures depends on various architecture capabilities such as cache and memory hierarchy, threads, and SIMD parallelism. This paper introduces a framework that uses static analysis and empirical autotuning to produce optimized parallel tensor transposition code for node architectures using a rule-based code generation and transformation system. By exploring various optimization techniques with different settings, our framework achieves more than 80% of the bandwidth of memcpy for tensors on two very different node architectures, one a dual-socket system with Intel Westmere processors and the other a quad-socket system with IBM Power7 processors.

Keywords:

Transposition (music)
Computer science
Parallel computing
Thread (computing)
Tensor
Architecture
Memory hierarchy
In-place matrix transposition
Code generation
Tensor contraction
Theoretical computer science
SIMD
Transpose
Cache

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations