A High-Performance Multi-array Accelerator for Large-Scale Floating-Point Matrix Multiplication

2019 
Large-scale matrix multiplication is a fundamental kernel in science and engineering applications. However, existing computing platforms such as CPU, GPU and FPGA suffer from limited performance or excessive power consumption. This paper presents a high-performance and efficient accelerator named MMA for floating-point matrix multiplication based on scaled- out multi-array systolic arrays. A scheduling method is proposed for efficiently performing large-scale matrix-multiplication. Besides, an analytical model is built based on related design parameters to explore and determine the optimal design space. Evaluation results show that the accelerator with 8x8 matrix processing arrays nested with 8x8 systolic array can achieve a maximum performance of 12 TFLOPS and an efficiency of 99% for large-scale matrix multiplication. Compared with NVIDIA TESLA K4O GPU implemented with similar process, the actual performance of MMA is 2.57x of K4O, while the area is only about 58.4% of the latter.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []