Sequential Transformer via an Outside-In Attention for image captioning

2022 
Attention-based approaches have been firmly established the state of the art in image captioning tasks. However, both the recurrent attention in recurrent neural network (RNN) and the self attention in transformer have limitations. Recurrent attention only takes the external state to decide where to look, while ignoring to discover the internal relationships between image regions. Self attention is just the opposite. To fill this gap, we firstly introduce an Outside-in Attention that makes the external state participate in the interaction of the image regions. And, it prompts the model to learn the dependency inside the image regions, as well as the dependency between image regions and the external state. Then, we investigate a Sequential Transformer Framework (S-Transformer) based on the original Transformer structure, where the decoder is incorporated with the Outside-in Attention and RNN. This framework can help the model to inherit the advantages of both the transformer and recurrent network in sequence modeling. When tested on COCO dataset, the proposed approaches achieve competitive results in single-model and ensemble configurations on both MSCOCO Karpathy test split and the online test server.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []