Optimizing One by One Direct Convolution on ARMv8 Multi-core CPUs

2020 
Convolutional layers are ubiquitous in a variety of deep neural networks. Due to the lower computation complexity and the smaller number of parameters, convolutions with small filter sizes are often used, such as one by one convolution. Nevertheless, these small convolution operations are still time-consuming. A common approach to implementing convolutions is to transform them into matrix multiplications, known as GEMM-based convolutions. The approach maybe incurs additional memory overhead and calls matrix multiplication routines, which are not optimized for matrices generated by convolutions. In this paper, we present a new parallel one by one direct convolution implementation on ARMv8 multi-core CPUs, which doesn't incur any additional memory space requirement. Our implementation is verified on two ARMv8 CPUs, Phytium FT-1500A and FT-2000plus. In terms of performance and scalability, our implementation is better than GEMM-based implementations in all the tests on Phytium FT-1500A. On Phytium FT-2000plus, our approach gives much better performance and scalability than GEMM-based approaches in most cases.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    12
    References
    1
    Citations
    NaN
    KQI
    []