DVAMN: Dual Visual Attention Matching Network for Zero-Shot Action Recognition

2021 
Zero-Shot Action Recognition (ZSAR) aims to transfer knowledge from a source domain to a target domain so that the unlabelled action can be inferred and recognized. However, previous methods often fail to highlight information about the salient factors of the video sequence. In the process of cross-modal search, information redundancy will weaken the association of key information among different modes. In this paper, we propose Dual Visual Attention Matching Network (DVAMN) to distill sparse saliency information from action video. We utilize dual visual attention mechanism and spatiotemporal Gated Recurrent Units (GRU) to establish irredundant and sparse visual space, which can boost the performance of the cross-modal recognition. Relational learning strategy is employed for final classification. Moreover, the whole network is trained in an end-to-end manner. Experiments on both the HMDB51 and the UCF101 datasets show that the proposed architecture achieves state-of-the-art results among methods using only spatial and temporal video features in zero-shot action recognition.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []