Large-scale Pretraining for Visual Dialog: A Simple State-of-the-Art Baseline
2019
Prior work in visual dialog has focused on training deep neural models on VisDial in isolation. Instead, we present an approach to leverage pretraining on related vision-language datasets before transferring to visual dialog. We adapt the recently proposed ViLBERT model for multi-turn visually-grounded conversations. Our model is pretrained on the Conceptual Captions and Visual Question Answering datasets, and finetuned on VisDial. Our best single model outperforms prior published work by \(1\%\) absolute on NDCG and MRR.
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
61
References
43
Citations
NaN
KQI