Improved Speaker and Navigator for Vision-and-Language Navigation

2021 
Prior works in vision-and-language navigation (VLN) focus on using long short-term memory (LSTM) to carry the flow of information on either the navigation model (navigator) or the instruction generating model (speaker).The outstanding capability of LSTM to process inter-modal interactions has been widely verified, however, LSTM neglects the intra-model interactions, leading to negative effect on either navigator or speaker. The performance of attention-based Transformer is satisfactory in sequence-to-sequence translation domains, but Transformer structure implemented directly in VLN has yet been satisfied. In this paper, we propose novel Transformer-based multi-modal frameworks for the navigator and speaker respectively. In our frameworks, the multi-head self-attention with the residual connection is used to carry the information flow. Specially, we set a switch to prevent them from directly entering the information flow in our navigator framework. In experiments, we verify the effectiveness of our proposed approach, and show significant performance advantages over the baselines.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    1
    Citations
    NaN
    KQI
    []