Faster Transformer Decoding: N-gram Masked Self-Attention.

Ciprian Chelba,Mia Xu Chen,Ankur Bapna,Noam Shazeer

Faster Transformer Decoding: N-gram Masked Self-Attention.

2020

Ciprian Chelba
Mia Xu Chen
Ankur Bapna
Noam Shazeer

Motivated by the fact that most of the information relevant to the prediction of target tokens is drawn from the source sentence $S=s_1, \ldots, s_S$, we propose truncating the target-side window used for computing self-attention by making an $N$-gram assumption. Experiments on WMT EnDe and EnFr data sets show that the $N$-gram masked self-attention model loses very little in BLEU score for $N$ values in the range $4, \ldots, 8$, depending on the task.

Keywords:

Data set
Transformer
Decoding methods
n-gram
Machine learning
Sentence
Artificial intelligence
Speech recognition
BLEU
Mathematics
self attention

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations