SIMD Parallel Sparse Matrix-Vector and Transposed-Matrix-Vector Multiplication in DD Precision

Toshiaki Hishinuma,Hidehiko Hasegawa,Teruo Tanaka

SIMD Parallel Sparse Matrix-Vector and Transposed-Matrix-Vector Multiplication in DD Precision

2016

Toshiaki Hishinuma
Hidehiko Hasegawa
Teruo Tanaka

We accelerate a double-precision sparse matrix and DD vector multiplication (DD-SpMV) and its transposition and DD vector multiplication (DD-TSpMV) using SIMD AVX2. AVX2 requires changing the memory access pattern to allow four consecutive 64-bit elements to be read at once. In our previous research, DD-SpMV in CRS using AVX2 needed non-continuous memory load, processing for the remainder, and the summation of four elements in the AVX2 register. These factors degrade the performance of DD-SpMV. In this paper, we compare the storage formats of DD-SpMV and DD-TSpMV for AVX2 to eliminate the performance degradation factors in CRS. Our result indicates that BCRS4x1, whose block size fits the AVX2 register’s length, is effective for DD-SpMV and DD-TSpMV.

Keywords:

Theoretical computer science
Parallel computing
Block size
Transposition (music)
Multiplication
Sparse matrix
SIMD
Remainder
Transpose
Computer science
Memory access pattern
memory load
sparse matrix vector

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations