Split and Attentive-Aggregated Learnable Shift Module for Video Action Recognition

2020 
Existing approaches for video action recognition using convolutional neural network (CNN) usually suffer from the trade-off between accuracy and complexity. On the one hand, the 2D CNNs have difficulty in modeling the long-term temporal dependencies though they are computationally cheap. On the other hand, 3D CNNs have the ability to perceive temporal cues however lead to a high computational cost. In this paper, we propose a generic building block named Split and Attentive-aggregated Learnable Shift Module (SALSM) which has capacity of modeling spatiotemporal representations while maintain the complexity of the 2D CNN. Specifically, we split the input tensor into multiple groups, and conduct adaptive shift operations by applying the learnable shift kernels for different channels of each group along time dimension, so that the spatiotemporal information from neighboring frames can be mingled with 2D convolutions. The output feature maps of each group are integrated together with attention mechanism. With SALSM plugged in, the 2D CNN is enhanced to handle temporal information and become a highly efficient spatiotemporal feature extractor with little parameters and computational cost. We conduct ablation experiments to verify the effectiveness of our method, and our proposed SALSM achieves competitive or even better results over the state-of-the-art methods on several benchmark datasets.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    22
    References
    0
    Citations
    NaN
    KQI
    []