Hierarchical Representation Network With Auxiliary Tasks For Video Captioning

2021 
Video captioning is to understand a video in depth and generate high-quality descriptions. However, due to the complexity of videos, it is challenging to extract a video feature that can well represent multiple levels of concepts i.e., events, objects and actions. Meanwhile, content completeness and syntactic consistency play an important role in high-quality video captioning. Motivated by these, we propose a novel framework, named Hierarchical Representation Network with Auxiliary Tasks (HRNAT), for learning multi-level representations, and generating syntax-aware video captioning. Specifically, the Cross-modality Matching Task enables the learning of hierarchical representation of videos, guided by the three-level representation of languages. The Syntax-guiding Task and Vision-assist Task contribute to generating descriptions to be not only globally similar to the video, but also syntax-consistent to the ground-truth description. Finally, performances on several benchmark datasets validate the effectiveness and superiority of our method compared with state-of-the-art methods.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []