Coupled-dynamic Learning for Vision and Language: Exploring Interaction between Different Tasks

2021 
Abstract Intensive research interests have been paid for the vision and language communities. Especially, image captioning task aims to generate natural language descriptions from the image content. Oppositely, image synthesis task aims to generate realistic images from natural language descriptions. Moreover, both of them can achieve promising results by using Long Short-Term Memory (LSTM), which models the sequence dynamics at each time step as hidden state. Nevertheless, the research on dynamics is often limited in the individual task, while there is no progress exploring the mutual relationship between dynamics in different tasks. In this work, we present a novel coupled-dynamic formulation that can iteratively reduce the distance between task-dependent dynamics in the training process. To embed adverse information into individual network, we construct dual-loss architectures to interactively align dynamics. We evaluate the proposed framework on Flickr8k, Flickr30k and MSCOCO datasets. Experimental results show that our approach can boost dual tasks together and achieve competing performances against state-of-the-art methods.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    41
    References
    1
    Citations
    NaN
    KQI
    []