A Visual Persistence Model for Image Captioning
2021
Abstract Object-level features from Faster R-CNN and attention mechanism have been used extensively in image captioning based on Encoder-Decoder frameworks. However, most existing methods feed the average pooling of object features as the global representation to the captioning model and recalculate the attention weights of object regions when generating a new word without considering the visual persistence like humans. In this paper, we respectively build Visual Persistence modules in encoder and decoder: The visual persistence module in encoder seeks the core object features to replace the image global representation; the visual persistence module in decoder evaluates the correlation between previous attention results and current attention results, and fuses them as the final attended feature to generate a new word. The experimental results on MSCOCO validate the effectiveness and competitiveness of our Visual Persistence Model (VPNet). Remarkably, VPNet also achieves competitive scores in most metrics on MSCOCO online test server compared to the existing state-of-the-art methods.
- Correction
- Source
- Cite
- Save
- Machine Reading By IdeaReader
50
References
0
Citations
NaN
KQI