Linear Transformers Are Secretly Fast Weight Memory Systems

Imanol Schlag,Kazuki Irie,Jürgen Schmidhuber

Linear Transformers Are Secretly Fast Weight Memory Systems

2021

Imanol Schlag
Kazuki Irie
Jürgen Schmidhuber

We show the formal equivalence of linearised self-attention mechanisms and fast weight memories from the early '90s. From this observation we infer a memory capacity limitation of recent linearised softmax attention variants. With finite memory, a desirable behaviour of fast weight memory models is to manipulate the contents of memory and dynamically interact with it. Inspired by previous work on fast weights, we propose to replace the update rule with an alternative rule yielding such behaviour. We also propose a new kernel function to linearise attention, balancing simplicity and effectiveness. We conduct experiments on synthetic retrieval problems as well as standard machine translation and language modelling tasks which demonstrate the benefits of our methods.

Keywords:

Softmax function
Dynamic and formal equivalence
memory systems
language modelling
Computer science
transformer
Machine translation
Kernel (statistics)
Theoretical computer science

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations