Dynamic Spatio-Temporal Feature Learning via Graph Convolution in 3D Convolutional Networks

2019 
Video data owns strong dynamic features in both spatial and temporal domains. In the literature, 3D Convolutional Neural Networks (3D CNNs) serve as a successful technique to simultaneously learn the spatio-temporal features. However, due to the expensive computations, usually the kernel size of the convolutions used in 3D CNNs is rather small, and thus largely limits their learning capability. To address this issue, in this paper we attempt to enhance the learning capability of 3D CNNs for extracting the dynamic features. We capture the long-distance information by modeling the temporal and spatial features as graphs, and then learn the dynamic graph structure information from the feature maps of 3D CNNs. This corresponds to the powerful Graph Convolutional Networks (GCNs), whose adjacent matrix is determined dynamically based on the feature maps. With the learnt dynamic graph, we introduce and fuse a framewise GCN and a channel-wise GCN to enhance the temporal and spatial feature learning of 3D CNNs. Our proposed spatiotemporal graph convolutional network (STGCN) works as a general module that can be embedded into the popular 3D CNNs architectures (e.g., ResNeXt, P3D). Extensive experiments on two video datasets (UCF-101 and HMDB-51) for action recognition task demonstrate the state-of-the-art models with our STGCN module can achieve significant performance improvement.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    46
    References
    2
    Citations
    NaN
    KQI
    []