Query-Key Normalization for Transformers.

Alex Henry,Prudhvi Raj Dachapally,Shubham Pawar,Yuxuan Chen

Query-Key Normalization for Transformers.

2020

Alex Henry
Prudhvi Raj Dachapally
Shubham Pawar
Yuxuan Chen

Low-resource language translation is a challenging but socially valuable NLP task. Building on recent work adapting the Transformer's normalization to this setting, we propose QKNorm, a normalization technique that modifies the attention mechanism to make the softmax function less prone to arbitrary saturation without sacrificing expressivity. Specifically, we apply $\ell_2$ normalization along the head dimension of each query and key matrix prior to multiplying them and then scale up by a learnable parameter instead of dividing by the square root of the embedding dimension. We show improvements averaging 0.928 BLEU over state-of-the-art bilingual benchmarks for 5 low-resource translation pairs from the TED Talks corpus and IWSLT'15.

Keywords:

Natural language processing
language translation
Division (mathematics)
Artificial intelligence
Normalization (statistics)
Square root
Computer science
Embedding
Softmax function
Expressivity
Matrix (mathematics)

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations