A Novel Convolutional Architecture For Video-Text Retrieval
2020
The prevalent video-text retrieval methods usually use recurrent neural networks to encode sequences of frames in videos and sequences of words in text. In this paper, we introduce an encoding architecture based entirely on convolutional neural networks. Compared to recurrent models, the complexity is smaller, and computations over all elements can be fully parallelized during training to better exploit the GPU. We use the stacking of convolution kernels of different scales to realize the encoding of local and long-term features of video and text. Experiments validate that our method achieves a new state-of-the-art for the video-text retrieval on MSR-VTT and MSVD datasets with less training time.
Keywords:
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
23
References
2
Citations
NaN
KQI