Backtracking gradient descent method for general $C^1$ functions

2018 
The backtracking gradient descent method is a popular method for finding critical points (in particular, minima) of a $C^1$ function $f$, and is defined as follows. Given $\delta _0>0$ and $1>\beta >0$. We define for $x\in \mathbb{R}^m$ the number $\delta (f,\delta _0,x) $ $=$ the largest number $\delta$ among the set $\{\beta ^n\delta _0:~n=0,1,2,\ldots \}$ for which $f(x - \delta \nabla f(x))-f(x)\leq -\delta ||\nabla f(x)||^2/2$. For any $z_0\in \mathbb{R}^m$, define $z_n=z_{n-1}-\delta (f,\delta _0,z_{n-1})\nabla f(z_{n-1})$, for $n=1,2,\ldots $. Our main results are the following: {\bf Theorem.} (1) Any cluster point of $\{z_n\}$ is a critical point of $f$. (2) Either $\lim _{n\rightarrow\infty}||z_n||=\infty$ or $\lim _{n\rightarrow\infty}||z_{n+1}-z_n||=0$. (3) Assume that $f$ has at most countably many critical points. Then either $\lim _{n\rightarrow\infty}||z_n||=\infty$ or $\{z_n\}$ converges to a critical point of $f$. (4) More generally, assume that all connected components of the set of critical points of $f$ are compact. Then either $\lim _{n\rightarrow\infty}||z_n||=\infty$ or $\{z_n\}$ is bounded. Moreover, in the latter case the set of cluster points of $\{z_n\}$ is connected. Concerning what type of critical points $z_{\infty}$ may appear, we prove in another theorem that it is very rare for a cluster point of the sequence $\{z_n\}$ to be a saddle point. We then introduce a new method mixing between the usual gradient descent method and (a slight modification of) the backtracking gradient descent method, aiming to increase good features while reducing bad features of the two methods. Experiments with the MNIST and CIFAR10 data sets show that the new method is on par with current state-of-the-art methods such as MMT, NAG, Adagrad, Adadelta, RMSProp and Adam, in particular in the long run.
    • Correction
    • Cite
    • Save
    • Machine Reading By IdeaReader
    0
    References
    0
    Citations
    NaN
    KQI
    []