High-Performance Tensor Contraction without Transposition

2018 
Tensor computations---in particular tensor contraction (TC)---are important kernels in many scientific computing applications. Due to the fundamental similarity of TC to matrix multiplication and to the availability of optimized implementations such as the BLAS, tensor operations have traditionally been implemented in terms of BLAS operations, incurring both a performance and a storage overhead. Instead, we implement TC using the flexible BLAS-like Instantiation Software (BLIS) framework, which allows for transposition (reshaping) of the tensor to be fused with internal partitioning and packing operations, requiring no explicit transposition operations or additional workspace. This implementation, TBLIS, achieves performance approaching that of matrix multiplication, and in some cases considerably higher than that of traditional TC. Our implementation supports multithreading using an approach identical to that used for matrix multiplication in BLIS, with similar performance characteristics. The complexity...
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    41
    References
    46
    Citations
    NaN
    KQI
    []