A Novel Convolutional Architecture For Video-Text Retrieval

Zheng Li,Caili Guo,Bo Yang,Zerun Feng,Hao Zhang

A Novel Convolutional Architecture For Video-Text Retrieval

2020

Zheng Li
Caili Guo
Bo Yang
Zerun Feng
Hao Zhang

The prevalent video-text retrieval methods usually use recurrent neural networks to encode sequences of frames in videos and sequences of words in text. In this paper, we introduce an encoding architecture based entirely on convolutional neural networks. Compared to recurrent models, the complexity is smaller, and computations over all elements can be fully parallelized during training to better exploit the GPU. We use the stacking of convolution kernels of different scales to realize the encoding of local and long-term features of video and text. Experiments validate that our method achieves a new state-of-the-art for the video-text retrieval on MSR-VTT and MSVD datasets with less training time.

Keywords:

text retrieval
Architecture
Computer science
Speech recognition
Computer vision
Artificial intelligence

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations