Aligned attention for common multimodal embeddings

2020 
Deep learning has been attributed to incredible advances in computer vision, natural language processing, and general pattern understanding. Recent discoveries have enabled efficient vector representations of both visual and written stimuli. Robustly transferring between the two modalities remains a challenge that could yield benefits for search, retrieval, and storage applications. We introduce a simple, yet highly effective approach for building a connection space where natural language sentences are tightly coupled with visual data. In this connection space, similar concepts lie close, whereas dissimilar concepts lie far apart, irrespective of their modality. We introduce an attention mechanism to align multimodal embeddings that are learned through a multimodal metric loss function. We evaluate the learned common vector space on multiple image–text datasets—Pascal Sentences, NUS-WIDE-10k, XMediaNet, Flowers, and Caltech-UCSD Birds. We extend our method to five different modalities of image, sentence, audio, video, and 3D models to demonstrate cross-modal retrieval on the XMedia dataset. We obtain state-of-the-art retrieval and zero-shot retrieval across all datasets.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []