An 879GOPS 243mW 80fps VGA Fully Visual CNN-SLAM Processor for Wide-Range Autonomous Exploration

2019 
Simultaneous localization and mapping (SLAM) estimates an agent’s trajectory for all six degrees of freedom (6 DoF) and constructs a 3D map of an unknown surrounding. It is a fundamental kernel that enables head-mounted augmented/virtual reality devices and autonomous navigation of micro aerial vehicles. A noticeable recent trend in visual SLAM is to apply computation- and memory-intensive convolutional neural networks (CNNs) that outperform traditional hand-designed feature-based methods [1]. For each video frame, CNN-extracted features are matched with stored keypoints to estimate the agent’s 6-DoF pose by solving a perspective-n-points (PnP) non-linear optimization problem (Fig. 7.3.1, left). The agent’s long-term trajectory over multiple frames is refined by a bundle adjustment process (BA, Fig. 7.3.1 right), which involves a large-scale ($\sim$120 variables) non-linear optimization. Visual SLAM requires massive computation ($\gt250$ GOP/s) in the CNN-based feature extraction and matching, as well as data-dependent dynamic memory access and control flow with high-precision operations, creating significant low-power design challenges. Software implementations are impractical, resulting in 0.2s runtime with a $\sim$3 GHz CPU + GPU system with $\gt100$ MB memory footprint and $\gt100$ W power consumption. Prior ASICs have implemented either an incomplete SLAM system [2, 3] that lacks estimation of ego-motion or employed a simplified (non-CNN) feature extraction and tracking [2, 4, 5] that limits SLAM quality and range. A recent ASIC [5] augments visual SLAM with an off-chip high-precision inertial measurement unit (IMU), simplifying the computational complexity, but incurring additional power and cost overhead.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    3
    References
    14
    Citations
    NaN
    KQI
    []