Fast Graph Convolution Network Based Multi-label Image Recognition via Cross-modal Fusion

2020 
In multi-label image recognition, it has become a popular method to predict those labels that co-occur in an image via modeling the label dependencies. Previous works focus on capturing the correlation between labels, but neglect to effectively fuse the image features and label embeddings, which severely affects the convergence efficiency of the model and inhibits the further precision improvement of multi-label image recognition. To overcome this shortcoming, in this paper, we introduce Multi-modal Factorized Bilinear pooling (MFB) which works as an efficient component to fuse cross-modal embeddings and propose F-GCN, a fast graph convolution network (GCN) based multi-label image recognition model. F-GCN consists of three key modules: (1) an image representation learning module which adopts a convolution neural network (CNN) to learn and generate image representations, (2) a label co-occurrence embedding module which first obtains the label vectors via the word embeddings technique and then adopts GCN to capture label co-occurrence embeddings and (3) an MFB fusion module which efficiently fuses these cross-modal vectors to enable an end-to-end model with a multi-label loss function. We conduct extensive experiments on two multi-label datasets including MS-COCO and VOC2007. Experimental results demonstrate the MFB component efficiently fuses image representations and label co-occurrence embeddings and thus greatly improves the convergence efficiency of the model. In addition, the performance of image recognition has also been promoted compared with the state-of-the-art methods.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    18
    References
    4
    Citations
    NaN
    KQI
    []