Memory-Efficient Adaptive Optimization for Large-Scale Learning.

Rohan Anil,Vineet Gupta,Tomer Koren,Yoram Singer

Memory-Efficient Adaptive Optimization for Large-Scale Learning.

2019

Rohan Anil
Vineet Gupta
Tomer Koren
Yoram Singer

Adaptive gradient-based optimizers such as AdaGrad and Adam are among the methods of choice in modern machine learning. These methods maintain second-order statistics of each parameter, thus doubling the memory footprint of the optimizer. In behemoth-size applications, this memory overhead restricts the size of the model being used as well as the number of examples in a mini-batch. We describe a novel, simple, and flexible adaptive optimization method with sublinear memory cost that retains the benefits of per-parameter adaptivity while allowing for larger models and mini-batches. We give convergence guarantees for our method and demonstrate its effectiveness in training very large deep models.

Keywords:

Mathematical optimization
Convergence (routing)
Memory footprint
Adaptive optimization
Sublinear function
Mathematics
Computer science
Distributed computing

Correction
Source
Cite
Save
Machine Reading By IdeaReader

References

Citations