Hybrid-Attention Enhanced Two-Stream Fusion Network for Video Venue Prediction

2020 
Video venue category prediction has been drawing more attention in the multimedia community for various applications such as personalized location recommendation, video verification and law enforcement. Most of existing works resort to the information from either multiple modalities or other platforms for strengthening video representations. However, noisy acoustic information, sparse textual descriptions and incompatible cross-platform data could limit the performance gain and reduce the universality of the model. Different from existing works, we focus on discriminative visual feature extraction from videos by introducing a hybrid-attention network structure. Particularly, we propose a novel Global-Local Attention Module (GLAM), which can be inserted to neural networks to generate enhanced visual features from video content. In GLAM, the Global Attention (GA) is used to catch contextual scene-oriented information and their layouts via assigning channels with various weights while the Local Attention (LA) is employed to learn salient object-oriented features via allocating different weights for spatial regions. Moreover, GLAM can be extended to ones with multiple GAs and LAs for further visual enhancement. These two types of features respectively captured by GAs and LAs are integrated via convolution layers and then delivered into convolutional Long Short-Term Memory (convLSTM) to generate discriminative spatial-temporal representations, constituting the content stream. In addition, video motions are explored to learn long-term movement variations because they can also contribute to video venue category prediction. The content and motion information constitute our proposed Hybrid-Attention Enhanced Two-Stream Fusion Network (HA-TSFN). HA-TSFN finally fuses features from two streams for complementary representations. Extensive experiments demonstrate that our method achieves state-of-the-art prediction performance in the large-scale dataset Vine. The visualization also shows that the proposed GLAM can capture complementary scene-oriented and object-oriented visual features from videos.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    50
    References
    0
    Citations
    NaN
    KQI
    []