Relation-Aware Multi-hop Reasoning forVisual Dialog.

2021 
Visual dialog is a multi-modal task that requires a dialog agent to answer a series of progressive questions grounded in an image. In this paper, we propose Relation-aware Multi-hop Reasoning Network (i.e. R2N for short) for visual dialog tasks, which can perform multi-hop reasoning during visual co-reference resolution process in a recurrent way. At each hop, in order to fully understand the visual scene in the image, a Relation-aware Graph Attention Network is used, which encodes each image into graphs with multi-type inter-object relations via a graph attention mechanism. Moreover, we find that the auxiliary clustering mechanism on answer candidates is conducive to model’s performance. We evaluate R2N on VisDial v1.0 dataset. Experimental results on the VisDial v1.0 dataset demonstrate that the proposed model is effective and outperforms compared models.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    18
    References
    0
    Citations
    NaN
    KQI
    []