Query-Key Normalization for Transformers

Alex Henry,Prudhvi Raj Dachapally,Shubham Shantaram Pawar,Yuxuan Chen

Query-Key Normalization for Transformers

2020

Alex Henry
Prudhvi Raj Dachapally
Shubham Shantaram Pawar
Yuxuan Chen

Low-resource language translation is a challenging but socially valuable NLP task. Building on recent work adapting the Transformer’s normalization to this setting, we propose QKNorm, a normalization technique that modifies the attention mechanism to make the softmax function less prone to arbitrary saturation without sacrificing expressivity. Specifically, we apply l2-normalization along the head dimension of each query and key matrix prior to multiplying them and then scale up by a learnable parameter instead of dividing by the square root of the embedding dimension. We show improvements averaging 0.928 BLEU over state-of-the-art bilingual benchmarks for 5 low-resource translation pairs from the TED Talks corpus and IWSLT’15.

Keywords:

Matrix (mathematics)
Normalization (statistics)
Embedding
Natural language processing
Transformer
Softmax function
Artificial intelligence
Square root
Division (mathematics)
Computer science
language translation

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations