Improved Speaker and Navigator for Vision-and-Language Navigation

Zongkai Wu,Zihan Liu,Ting Wang,Donglin Wang

Improved Speaker and Navigator for Vision-and-Language Navigation

2021

Prior works in vision-and-language navigation (VLN) focus on using long short-term memory (LSTM) to carry the flow of information on either the navigation model (navigator) or the instruction generating model (speaker).The outstanding capability of LSTM to process inter-modal interactions has been widely verified, however, LSTM neglects the intra-model interactions, leading to negative effect on either navigator or speaker. The performance of attention-based Transformer is satisfactory in sequence-to-sequence translation domains, but Transformer structure implemented directly in VLN has yet been satisfied. In this paper, we propose novel Transformer-based multi-modal frameworks for the navigator and speaker respectively. In our frameworks, the multi-head self-attention with the residual connection is used to carry the information flow. Specially, we set a switch to prevent them from directly entering the information flow in our navigator framework. In experiments, we verify the effectiveness of our proposed approach, and show significant performance advantages over the baselines.

Keywords:

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations