Hierarchical Representation Network With Auxiliary Tasks For Video Captioning

Yu Lei,Zhonghai He,Pengpeng Zeng,Jingkuan Song,Lianli Gao

Hierarchical Representation Network With Auxiliary Tasks For Video Captioning

2021

Video captioning is to understand a video in depth and generate high-quality descriptions. However, due to the complexity of videos, it is challenging to extract a video feature that can well represent multiple levels of concepts i.e., events, objects and actions. Meanwhile, content completeness and syntactic consistency play an important role in high-quality video captioning. Motivated by these, we propose a novel framework, named Hierarchical Representation Network with Auxiliary Tasks (HRNAT), for learning multi-level representations, and generating syntax-aware video captioning. Specifically, the Cross-modality Matching Task enables the learning of hierarchical representation of videos, guided by the three-level representation of languages. The Syntax-guiding Task and Vision-assist Task contribute to generating descriptions to be not only globally similar to the video, but also syntax-consistent to the ground-truth description. Finally, performances on several benchmark datasets validate the effectiveness and superiority of our method compared with state-of-the-art methods.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations