A Stagewise Hyperparameter Scheduler to Improve Generalization

2021 
Stochastic gradient descent (SGD) augmented with various momentum variants (e.g. heavy ball momentum (SHB) and Nesterov's accelerated gradient (NAG)) has been the default optimizer for many learning tasks. Tuning the optimizer's hyperparameters is arguably the most time-consuming part of model training. Many new momentum variants, despite their empirical advantage over classical SHB/NAG, introduce even more hyperparameters to tune. Automating the tedious and error-prone tuning is essential for AutoML. This paper focuses on how to efficiently tune a large class of multistage momentum variants to improve generalization. We use the general formulation of quasi-hyperbolic momentum (QHM) and extend "constant and drop'', the widespread learning rate α scheduler where α is set large initially and then dropped every few epochs, to other hyperparameters (e.g. batch size b, momentum parameter β, instant discount factor ν). Multistage QHM is a unified framework which covers a large family of momentum variants as its special cases (e.g. vanilla SGD/SHB/NAG). Existing works mainly focus on scheduling α's decay, while multistage QHM allows additional varying hyperparameters such as b, β, and ν, and demonstrates better generalization ability than only tuning α. Our tuning strategies have rigorous justifications rather than a blind trial-and-error. We theoretically prove why our tuning strategies could improve generalization. We also show the convergence of multistage QHM for general nonconvex objective functions. Our strategies simplify the tuning process and beat competitive optimizers in test accuracy empirically.
    • Correction
    • Source
    • Cite
    • Save
    • Machine Reading By IdeaReader
    23
    References
    0
    Citations
    NaN
    KQI
    []