Fast implementation of DGEMM on Fermi GPU

Guangming Tan,Linchuan Li,Sean Triechle,Everett H. Phillips,Yungang Bao,Ninghui Sun

Fast implementation of DGEMM on Fermi GPU

2011

Guangming Tan
Linchuan Li
Sean Triechle
Everett H. Phillips
Yungang Bao
Ninghui Sun

In this paper we present a thorough experience on tuning double-precision matrix-matrix multiplication (DGEM-M) on the Fermi GPU architecture. We choose an optimal algorithm with blocking in both shared memory and registers to satisfy the constraints of the Fermi memory hierarchy. Our optimization strategy is further guided by a performance modeling based on micro-architecture benchmarks. Our optimizations include software pipelining, use of vector memory operations, and instruction scheduling. Our best CUDA algorithm achieves comparable performance with the latest CUBLAS library 1 . We further improve upon this with an implementation in the native machine language, leading to 20% increase in performance. That is, the achieved peak performance (efficiency) is improved from 302Gflop/s (58%) to 362Gflop/s (70%).

Keywords:

Basic Linear Algebra Subprograms
Parallel computing
Memory management
Instruction scheduling
Architecture
Computer science
Computer architecture
Shared memory
Software pipelining
Memory hierarchy
Instruction set
CUDA

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations