Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks

Boris Ginsburg,Patrice Castonguay,Oleksii Hrinchuk,Oleksii Kuchaiev,Vitaly Lavrukhin,Ryan Leary,Jason Li,Huyen Nguyen,Yang Zhang,Jonathan M. Cohen

Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks

2019

Boris Ginsburg
Patrice Castonguay
Oleksii Hrinchuk
Oleksii Kuchaiev
Vitaly Lavrukhin
Ryan Leary
Jason Li
Huyen Nguyen
Yang Zhang
Jonathan M. Cohen

We propose NovoGrad, a first-order stochastic gradient method with layer-wise gradient normalization via second moment estimators and with decoupled weight decay for a better regularization. The method requires half as much memory as Adam/AdamW. We evaluated NovoGrad on the diverse set of problems, including image classification, speech recognition, neural machine translation and language modeling. On these problems, NovoGrad performed equal to or better than SGD and Adam/AdamW. Empirically we show that NovoGrad (1) is very robust during the initial training phase and does not require learning rate warm-up, (2) works well with the same learning rate policy for different problems, and (3) generally performs better than other optimizers for very large batch sizes

Keywords:

Language model
Gradient method
Machine learning
Algorithm
Estimator
Artificial intelligence
Machine translation
Second moment of area
Mathematics
Normalization (statistics)
Contextual image classification
Regularization (mathematics)
Initialization
Stochastic gradient descent
Artificial neural network
Mathematical optimization
Memory footprint

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations