KSF-ST: Video Captioning Based on Key Semantic Frames Extraction and Spatio-Temporal Attention Mechanism

Zhaowei Qu,Luhan Zhang,Xiaoru Wang,Bingyu Cao,Yueli Li,Fu Li

KSF-ST: Video Captioning Based on Key Semantic Frames Extraction and Spatio-Temporal Attention Mechanism

2020

Video captioning is one of research hotspots in computer vision. At present, video captioning algorithms mainly have following problems: First, traditional algorithms use equal-interval sampling to extract video features, which causes the loss of key frames containing a large amount of semantic information, thus leading to the inaccuracy of video captioning. Moreover, equal-interval sampling method results in lots of redundant frames, thereby increasing the amount of computation of algorithms extremely. Second, traditional algorithms only consider temporal information when extracting features. However, for the image and video, the spatial features also contain rich latent semantic information. Only extracting temporal features will lead to inaccurate natural language descriptions. To address these problems, we propose the video captioning method based on key semantic frames extraction and spatio-temporal attention mechanism (KSF-ST) in this paper. In order to extract key semantic frames, knowledge graph is adopted to obtain key semantic information of video frames, and knowledge reasoning is used to obtain the correlation among entities in the knowledge graph. In order to extract spatial latent semantic information of video frames, spatial attention mechanism is combined with temporal features to generate accurate natural language descriptions. We evaluate KSF-ST on two benchmark datasets. Extensive experiments have been conducted and the results demonstrate that our algorithm could achieve better video captioning performance than the state-of-the-art algorithms.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations