Runtime Programmable and Memory Bandwidth Optimized FPGA-Based Coprocessor for Deep Convolutional Neural Network

2018 
The deep convolutional neural network (DCNN) is a class of machine learning algorithms based on feed-forward artificial neural network and is widely used for image processing applications. Implementation of DCNN in real-world problems needs high computational power and high memory bandwidth, in a power-constrained environment. A general purpose CPU cannot exploit different parallelisms offered by these algorithms and hence is slow and energy inefficient for practical use. We propose a field-programmable gate array (FPGA)-based runtime programmable coprocessor to accelerate feed-forward computation of DCNNs. The coprocessor can be programmed for a new network architecture at runtime without resynthesizing the FPGA hardware. Hence, it acts as a plug-and-use peripheral for the host computer. Caching is implemented for input features and filter weights using on-chip memory to reduce the external memory bandwidth requirement. Data are prefetched at several stages to avoid stalling of computational units and different optimization techniques are used to efficiently reuse the fetched data. Dataflow is dynamically adjusted in runtime for each DCNN layer to achieve consistent computational throughput across a wide range of input feature sizes and filter sizes. The coprocessor is prototyped using Xilinx Virtex-7 XC7VX485T FPGA-based VC707 board and operates at 150 MHz. Experimental results show that our implementation is $15{\times}$ energy efficient than highly optimized CPU implementation and achieves consistent computational throughput of more than 140 G operations/s for a wide range of input feature sizes and filter sizes. Off-chip memory transactions decrease by $111{\times}$ due to the use of the on-chip cache.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    31
    References
    29
    Citations
    NaN
    KQI
    []