Jointly Learning of Visual and Auditory: A New Approach for RS Image and Audio Cross-Modal Retrieval

2019 
Remote sensing (RS) images are widely used in civilian and military fields. With the highly increasing image data, it has become a challenging issue to achieve fast and efficient RS image retrieval. However, the existing image retrieval methods, text-based or content-based, are still limited in the applications; for example, text input is inefficient, and the sample image for query is often unavailable. It is known that speech is a natural and convenient way of communication. Therefore, a novel speech-image cross-modal retrieval approach, named deep visual-audio network ( DVAN ), is presented in this article, which can establish the direct relationship between image and speech from paired image-audio data. The model mainly has three parts: 1) Image feature extraction, which is used to extract effective features of RS images; 2) audio feature learning, which is used to recognizing key information from raw data, and AudioNet , as part of DVAN , is proposed to obtain more distinguishing features; 3) multimodal embedding, which is used to learn the direct correlations of two modalities. Experimental results on RS image audio dataset demonstrate that the proposed method is effective and speech-image retrieval is feasible, and it provides a new way for faster and more convenient RS image retrieval.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    67
    References
    10
    Citations
    NaN
    KQI
    []