Semantic Aligned Multi-modal Transformer for Vision-LanguageUnderstanding: A Preliminary Study on Visual QA

Han Ding,Li Erran Li,Zhiting Hu,Yi Xu,Dilek Hakkani Tür,Zheng Du,Belinda Zeng

Semantic Aligned Multi-modal Transformer for Vision-LanguageUnderstanding: A Preliminary Study on Visual QA

2021

Han Ding
Li Erran Li
Zhiting Hu
Yi Xu
Dilek Hakkani Tür
Zheng Du
Belinda Zeng

Recent vision-language understanding approaches adopt a multi-modal transformer pre-training and finetuning paradigm. Prior work learns representations of text tokens and visual features with cross-attention mechanisms and captures the alignment solely based on indirect signals. In this work, we propose to enhance the alignment mechanism by incorporating image scene graph structures as the bridge between the two modalities, and learning with new contrastive objectives. In our preliminary study on the challenging compositional visual question answering task, we show the proposed approach achieves improved results, demonstrating potentials to enhance vision-language understanding.

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations