DRDF: Determining the Importance of Different Multimodal Information with Dual-Router Dynamic Framework

2021 
In multimodal tasks, the importance of text and image modal information often varies for different input cases. To model the difference of importance of different modal information, we propose a high-performance and highly general Dual-Router Dynamic Framework (DRDF), consisting of Dual-Router, MWF-Layer, experts and expert fusion unit. The text router and image router in Dual-Router take text modal information and image modal information respectively, and MWF-Layer is responsible to determine the importance of modal information. Based on the result of the determination, MWF-Layer generates fused weights for the subsequent experts fusion. Experts can adopt a variety of backbones that match the current multimodal or unimodal task. DRDF features high generality and modularity, and we test 12 backbones such as Visual BERT and their corresponding DRDF instances on the multimodal dataset Hateful memes, and unimodal datasets CIFAR10, CIFAR100, and TinyImagenet. Our DRDF instance outperforms those backbones. We also validate the effectiveness of components of DRDF by ablation studies, and discuss the reasons and ideas of DRDF design.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    20
    References
    0
    Citations
    NaN
    KQI
    []