2.2 A 978GOPS/W Flexible Streaming Processor for Real-Time Image Processing Applications in 22nm FDSOI

2019 
The recent trend towards high frame rate, high resolution image processing necessitates a solution for the growing Von Neumann bottleneck. Bandwidth pressure is relieved in traditional image processors by tiling the image and storing pixel patches in a local scratchpad memory [1]. Yet, this only partially alleviates the memory bottleneck, due to repeated fetches for overlapping patches and the scratchpad access becoming the main bottleneck (Fig. 2.2.1). Streaming architectures [2] eliminate this overhead by streaming the image pixels into an array of processing elements and transferring intermediate data results to a next processing element instead of sending them back to memory. Crucial in the efficiency of such streaming image processing architectures is the line buffering strategy, required to merge intermediate results in the stream with previously computed results. Existing state-of-the-art architectures either have dedicated, application-specific line buffer instances which are inflexible towards various image processing workloads [1], or achieve flexibility by routing streams back to centralized scratchpad memories, again resulting in a memory bottleneck around the centralized buffers, and introducing the need for inefficient on-chip-networks between the processing elements [3]. This paper introduces the concept of flexible, instruction-programmable streaming processing with embedded configurable-delay FIFOs to enable line buffering for a wide range of computer vision algorithms on a single platform. Providing a deep reconfigurable pipeline, this design facilitates the mapping and execution of complex vision tasks, like dense optic flow in real-time (30fps VGA) at low power (10.7mW), marking A $5.8\times$ improvement over the state-of-the-art.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    3
    References
    6
    Citations
    NaN
    KQI
    []