KSF-ST: Video Captioning Based on Key Semantic Frames Extraction and Spatio-Temporal Attention Mechanism

2020 
Video captioning is one of research hotspots in computer vision. At present, video captioning algorithms mainly have following problems: First, traditional algorithms use equal-interval sampling to extract video features, which causes the loss of key frames containing a large amount of semantic information, thus leading to the inaccuracy of video captioning. Moreover, equal-interval sampling method results in lots of redundant frames, thereby increasing the amount of computation of algorithms extremely. Second, traditional algorithms only consider temporal information when extracting features. However, for the image and video, the spatial features also contain rich latent semantic information. Only extracting temporal features will lead to inaccurate natural language descriptions. To address these problems, we propose the video captioning method based on key semantic frames extraction and spatio-temporal attention mechanism (KSF-ST) in this paper. In order to extract key semantic frames, knowledge graph is adopted to obtain key semantic information of video frames, and knowledge reasoning is used to obtain the correlation among entities in the knowledge graph. In order to extract spatial latent semantic information of video frames, spatial attention mechanism is combined with temporal features to generate accurate natural language descriptions. We evaluate KSF-ST on two benchmark datasets. Extensive experiments have been conducted and the results demonstrate that our algorithm could achieve better video captioning performance than the state-of-the-art algorithms.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []