Multi-Stage Hybrid Embedding Fusion Network for Visual Question Answering

2020 
Abstract Multimodal fusion is a crucial component of Visual Question Answering (VQA), which involves joint understanding and semantic integration between visual and textual information. Existing VQA learning frameworks focus mainly on Latent Embedding Fusion (LEF) method, by projecting visual and textual features into a common latent space, and fusing them with element-wise multiplication. In this paper, we intend to achieve multiple and fine-grained multimodal interactions for enhancing fusion performance. To this end, we propose a Multi-stage Hybrid Embedding Fusion (MHEF) network to fulfill our improvements in two folds: First, we introduce a Dual Embedding Fusion (DEF) approach that transforms one modal input into the reciprocal embedding space before integration, and the DEF is further incorporated with the LEF to form a novel Hybrid Embedding Fusion (HEF). Second, we design a Multi-stage Fusion Structure (MFS) for the HEF to form the MHEF network, so as to obtain diverse and better fusion features for answer prediction. By jointly training the multi-stage framework, we can not only improve the performance in each single stage, but also obtain additional accuracy improvements by integrating all prediction results from each stage. Extensive experiments verify both our proposed HEF and MFS are beneficial to multi-modal fusion. The full MHEF model outperforms the baseline LEF model with 2% accuracy improvements, and achieves promising performance on the VQA-v1 and VQA-v2 datasets.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    29
    References
    2
    Citations
    NaN
    KQI
    []