Accelerating Depthwise Separable Convolutions with Vector Processor.

2021 
Depthwise separable convolution has demonstrated its advantages in reducing the number of parameters and neural network calculations. Convolution-oriented hardware accelerators are outstanding in terms of saving resources and energy. However, lightweight networks designed for small processors do not perform efficiently on these accelerators. Moreover, there are too many models to design an application-specific circuit for each model. In this work, we propose a method of mapping depthwise separable convolution on a general-purpose vector processor. This method achieves high computational performance by increasing data reuse and parallel execution. First of all, we propose a multi-vector parallel convolution method to reduce the number of data reads and increase data utilization in depthwise convolution. Then, we divide the data of pointwise convolution into coarse-grained blocks and compute matrix multiplication in parallel on a multi-core processor, achieving high computational efficiency. Furthermore, we use a double buffer mechanism to optimize data transfer and shorten execution time. Overall, using MobileNet to evaluate depthwise separable convolution, multi-vector parallel convolution method on M-DSP reduces the number of reads and writes by up to 4 times. We achieve 1518 FPS and 1.783 TFLOPS at a batch size of 1, which is 1.87\( \times \) faster than ZU9 MPSoc and 3.89\( \times \) more calculation-efficient than 2080Ti GPU.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    11
    References
    0
    Citations
    NaN
    KQI
    []