Multimodal Reconstruction Using Vector Representation

2018 
Recent work has demonstrated that neural embedding from multiple modalities can be utilized to focus the results of generative adversarial networks. However, little work has been done towards developing a procedure to combine vectors from different modalities for the purpose of reconstructing input. Generally, embeddings from different modalities are concatenated to create a larger input vector. In this paper, we propose learning a Common Vector Space (CVS) where similar inputs from different modalities cluster together. We develop a framework to analyze the extent of reconstruction and robustness offered by CVS. We apply the CVS for the purpose of annotating, generating and captioning images on MS-COCO. We show that CVS is on par with techniques used for multiple modality embeddings while offering more flexibility as the number of modalities increases.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    22
    References
    0
    Citations
    NaN
    KQI
    []