A Neural Image Captioning Model with Caption-to-Images Semantic Constructor

2019 
Abstract The current dominant image captioning models are mostly based on a CNN-LSTM encoder-decoder framework. Although this architecture has achieved remarkable progress, it still has shortcomings for not fully capturing the encoded image information. Specifically, the model only exploits image-to-caption dependency during the process of caption generation. In this paper, we extend the conventional CNN-LSTM image captioning model by introducing a caption-to-images semantic reconstructor, which reconstructs the semantic representations of the input image and its similar images from hidden states of the decoder. Serving as an auxiliary objective that evaluates the fidelity of the generated caption, the reconstruction score of semantic reconstructor is combined with the likelihood to refine model training. In this way, semantics of input image can be more effectively transferred to the decoder and be fully exploited to generate better captions. Besides, during model testing, the reconstruction score can be used along with log likelihood to select better caption via reranking. Experimental results show that the proposed model significantly improves the quality of the generated captions and outperforms a conventional image captioning model, LSTM-A5.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    59
    References
    6
    Citations
    NaN
    KQI
    []