Boundary Detector Encoder and Decoder with Soft Attention for Video Captioning

2019 
The use of Recurrent Neural Networks and Convolutional Neural Networks for video captioning has received widespread attention, since the deep learning has developed rapidly. Based on classical encoder-decoder approach, we modify the encoding networks and decoding networks to improve the performance of the entire networks. In this paper, we introduce an encoding scheme that can detect the hierarchical structure of the input video. What’s more, we use soft attention mechanism which can learn to automatically select the relevant input frames from the input video to generate the description of the input video. Extensive experiments are conducted on two datasets: the Microsoft Video Description Corpus and the MSR-Video To Text. Three metrics, BLEU@4, METEOR and CIDEr are used to evaluate our approach. Experimental results demonstrate the effectiveness of our approach.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    26
    References
    2
    Citations
    NaN
    KQI
    []