BERT Busters: Outlier LayerNorm Dimensions that Disrupt BERT.

Olga Kovaleva,Saurabh Kulshreshtha,Anna Rogers,Anna Rumshisky

BERT Busters: Outlier LayerNorm Dimensions that Disrupt BERT.

2021

Olga Kovaleva
Saurabh Kulshreshtha
Anna Rogers
Anna Rumshisky

Multiple studies have shown that BERT is remarkably robust to pruning, yet few if any of its components retain high importance across downstream tasks. Contrary to this received wisdom, we demonstrate that pre-trained Transformer encoders are surprisingly fragile to the removal of a very small number of scaling factors and biases in the output layer normalization (<0.0001% of model weights). These are high-magnitude normalization parameters that emerge early in pre-training and show up consistently in the same dimensional position throughout the model. They are present in all six models of BERT family that we examined and removing them significantly degrades both the MLM perplexity and the downstream task performance. Our results suggest that layer normalization plays a much more important role than usually assumed.

Keywords:

Small number
Outlier
Scaling
Perplexity
Pruning (decision trees)
Statistics
Normalization (statistics)
transformer
Computer science

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations