VDARN: Video Disentangling Attentive Relation Network for Few-Shot and Zero-Shot Action Recognition

2020 
Abstract In most supervised action recognition methods, sufficient labeled training instances are needed for each class, the learned model can only recognize the samples belonging to classes covered by the training data, and lacks the ability to deal with previously unseen classes. In this paper, we tackle the above challenges by proposing a novel Video Disentangling Attentive Relation Network (VDARN), which can be trained in an end-to-end manner by using a standard deep neural network framework. Compared with most of few-shot Action Recognition (FSAR) and Zero-Shot Action Recognition (ZSAR) methods, the proposed method considers the natural temporal misalignment between object cues and human motion cues in the visual space. Specifically, we propose a video disentangling module to decompose segment-wise video into motion cues and object cues. Furthermore, we design a dual branch attention module to align the segment-wise motion cues and segment-wise object cues in the temporal domain by learning the temporal autocorrelation (in the case of FSAR) of the two cues. In the case of ZSAR, the dual branch attention module is used to estimates the similarities between the semantic embedding and segment-wise motion/object cues of the video. Finally, a relation module is proposed to learn and measure the distance or similarity of video-video (FSAR) or video-unseen label (ZSAR). Extensive experimental results on three realistic action benchmark Olympic Sports, HMDB51 and UCF101 demonstrate the favorable performance of our proposed framework in both few-shot action recognition and zero-shot action recognition.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    49
    References
    0
    Citations
    NaN
    KQI
    []