Hierarchical Feature Aggregation Based on Transformer for Image-Text Matching

2022 
In order to carry out more accurate retrieval across image-text modalities, some scholars use fine-grained feature to align image and text. Most of them directly use attention mechanism to align image regions and words in the sentence, and ignore the fact that semantics related to an object is abstract and cannot be accurately expressed by object information alone. To overcome this weakness, we propose a hierarchical feature aggregation algorithm based on graph convolutional networks (GCN) to facilitate object semantic integrity by integrating attributes of an object and relations between objects hierarchically in both image and text modalities. In order to eliminate the semantic gap between modalities, we propose a cross-modal feature fusion method based on transformer to generate modal-specific feature representations by integrating both the object feature and global feature from the other modality. Then we map the fusion feature into a common space. Experiment results on the most frequently-used datasets MSCOCO and Flickr30K show the effectiveness of the proposed model compared with the latest methods.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []