Exploring Entity-Level Spatial Relationships for Image-Text Matching

2020 
Exploring the entity-level (i.e., objects in an image, words in a text) spatial relationship contributes to understanding multimedia content precisely. The ignorance of spatial information in previous works probably leads to misunderstandings of image contents. For instance, sentences ‘Boats are on the water’ and ‘Boats are under the water’ describe the same objects, but correspond to different sceneries. To this end, we utilize the relative position of objects to capture entity-level spatial relationships for image-text matching. Specifically, we fuse semantic and spatial relationships of image objects in a visual intra-modal relation module. The module performs promisingly to understand image contents and improve object representation learning. It contributes to capturing entity-level latent correspondence of image-text pairs. Then the query (text) plays a role of textual context to refine the interpretable alignments of image-text pairs in the inter-modal relation module. Our proposed method achieves state-of-the-art results on MSCOCO and Flickr30K datasets.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    22
    References
    0
    Citations
    NaN
    KQI
    []