Semantic Tag Augmented XlanV Model for Video Captioning

2021 
The key of video captioning is to leverage the cross-modal information from both vision and language perspectives. We propose to leverage the semantic tags to bridge the gap between these modalities rather than directly concatenating or attending to the visual and linguistic features as the previous works. The semantic tags are the object tags and the action tags detected in videos, which can be viewed as partial captions for the input video. To effectively exploit the semantic tags, we design a Semantic Tag augmented XlanV (ST-XlanV) model which encodes 4 kinds of visual and semantic features with X-Linear Attention based cross-attention modules. Moreover, tag related tasks are also designed in the pre-training stage to aid the model more fruitfully exploits the cross-modal information. The proposed model reaches the 5th place in the pre-training for video captioning challenge with the help of the semantic tags. Our codes will be available at: https://github.com/RubickH/ST-XlanV.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    22
    References
    0
    Citations
    NaN
    KQI
    []