Parallel-fusion LSTM with synchronous semantic and visual information for image captioning

2021 
Abstract For synchronously combining the dynamic semantic and visual information in the decoder part of image captioning, we propose a novel parallel-fusion LSTM (pLSTM) structure in this paper. Two parallel LSTMs with attributes and visual information of image are fused by the hidden states at every time step, which makes the attributes and visual information complementary or enhanced for generating more accurate captions. According to the different ways of integrating semantic information from attribute LSTM to visual LSTM, we propose two models pLSTM with attention (pLSTM-A) and pLSTM with guiding (pLSTM-G). pLSTM-A can automatically capture the crucial semantic and visual information to generate captions, and pLSTM-G directly adjusts the hidden state of visual LSTM by synchronous semantic information to the critical region. For verifying the effectiveness of our proposed pLSTM, we conduct a series of experiments on MSCOCO and Flickr30K datasets, and the experimental results outperform some state-of-the-art image captioning methods.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    42
    References
    1
    Citations
    NaN
    KQI
    []