Optimizing One by One Direct Convolution on ARMv8 Multi-core CPUs

Qinglin Wang,Dongsheng Li,Songzhu Mei,Shen Siqi,Xiandong Huang

Optimizing One by One Direct Convolution on ARMv8 Multi-core CPUs

2020

Convolutional layers are ubiquitous in a variety of deep neural networks. Due to the lower computation complexity and the smaller number of parameters, convolutions with small filter sizes are often used, such as one by one convolution. Nevertheless, these small convolution operations are still time-consuming. A common approach to implementing convolutions is to transform them into matrix multiplications, known as GEMM-based convolutions. The approach maybe incurs additional memory overhead and calls matrix multiplication routines, which are not optimized for matrices generated by convolutions. In this paper, we present a new parallel one by one direct convolution implementation on ARMv8 multi-core CPUs, which doesn't incur any additional memory space requirement. Our implementation is verified on two ARMv8 CPUs, Phytium FT-1500A and FT-2000plus. In terms of performance and scalability, our implementation is better than GEMM-based implementations in all the tests on Phytium FT-1500A. On Phytium FT-2000plus, our approach gives much better performance and scalability than GEMM-based approaches in most cases.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations