Graph Convolutional Network for Visual Question Answering Based on Fine-grained Question Representation.

2020 
Visual Question Answering (VQA) requires a fine-grained and simultaneous understanding of both the visual content of images and the textual content of questions. Recent research works mainly focus on the bilinear fusion methods and attention mechanism methods to explore the global features and local features, respectively. When an image exists multiple objects, the fine-grained question features can effectively identify the relationship between objects in an image. Therefore we propose a graph convolutional network based on fine-grained question representation (FQ-GCN). The object relation graph is firstly constructed. The fine-grained question features are used to explore the relations between objects in an image and prune unrelated edges between object nodes. Graph convolutional network is used to aggregate all neighborhood information of each object in an object graph. Experiments on the VQA 2.0 dataset show that the performance of our FQ-GCN model improves 2%$\sim 3$% than the classical methods.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    36
    References
    0
    Citations
    NaN
    KQI
    []