Reasoning like Humans: On Dynamic Attention Prior in Image Captioning

2021 
Abstract Attention-based models have been widely used in image captioning. Nevertheless, most conventional deep attention models perform attention operations for each block/step independently, which neglects prior knowledge obtained by previous steps. In this paper, we propose a novel method — DYnamic Attention PRior (DY-APR), which combines both attention distribution prior and local linguistic context for caption generation. Like human beings, DY-APR can gradually shift its attention from a multitude of objects to the one of keen interest when coping with an image of a complex scene. DY-APR first captures rough information and then explicitly updates attention weights step by step. Besides, DY-APR fully leverages local linguistic context from the previous tokens, that is, capitalizes on local information when performing global attention — which we refer to as “local–global attention”. We show that the prior knowledge from previous steps provides meaningful semantic information, serving as guidance to build more accurate attention for the latter layers. Experiments on the MS-COCO dataset demonstrate the effectiveness of DY-APR, leading to CIDEr-D improvement by 2.32% with less than 0.2% additional FLOPs and parameters.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    53
    References
    2
    Citations
    NaN
    KQI
    []