Auto-Tuning GEMV on Many-Core GPU
2012
GPUs provide powerful computing ability especially for data parallel algorithms. However, the complexity of the GPU system makes the optimization of even a simple algorithm difficult. Different parallel algorithms or optimization methods on a GPU often lead to very different performances. The matrix-vector multiplication routine for general dense matrices (GEMV) is a building block for many scientific and engineering computations. We find that the implementations of GEMV in CUBLAS 4.0 or MAGMA are not efficient, especially for small matrix or fat matrix (a matrix with small number of rows and large number of columns). In this paper, we propose two new algorithms to optimize GEMV on Fermi GPU. Instead of using only one thread, we use a warp to compute an element of vector y. We also propose a novel register blocking method to accelerate GEMV on GPU further. The proposed optimization methods for GEMV are comprehensively evaluated on the matrices with different sizes. Experiment results show that the new methods can achieve over 10x speedup for small square matrices and fat matrices compared to CUBLAS 4.0 or MAGMA, and the new register blocking method can also perform better than CUBLAS 4.0 or MAGMA for large square matrices. We also propose a performance-tuning framework on how to choose an optimal algorithm of GEMV for an arbitrary input matrix on GPU.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
5
References
5
Citations
NaN
KQI