Chinese description of videos incorporating multimodal features and attention mechanism

2021 
Video description is a hot topic in the area of computer vision and natural language processing, which has made remarkable achievements in recent years. But most researches on video description are to generate English description while few on Chinese description. This paper explores the generation process of video Chinese description and proposes a model for video Chinese description, which introduces three complementary modal features and temporal attention mechanism based on the general encoder-decoder framework. The optimized video description model combined with an appropriate Chinese preprocessing method further improves Chinese descriptions' richness and accuracy. These works provide a valuable reference for future research on multilingual video description. We tested the proposed Chinese model on an expanded Chinese corpus of standard English dataset MSVD (Microsoft Research video description corpus) and studied the special processing methods of Chinese description generation. Experimental results show that the highest METEOR value obtained by the Chinese model proposed is still 6.6% higher than that of the best result on MSVD's Chinese corpus, and the model also has an advanced result in English environment.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    15
    References
    0
    Citations
    NaN
    KQI
    []