Improving kernel online learning with a snapshot memory

Information is mounting exponentially, and the world is moving to hunt knowledge with the help of Big Data. The labelled data is used for automated learning and data analysis which is termed as Machine Learning. Linear Regression is a statistical method for predictive analysis. Gradient Descent is the process which uses cost function on gradients for minimizing the complexity in computing mean square error. This work presents an insight into the different types of Gradient descent algorithms namely, Batch Gradient Descent, Stochastic Gradient Descent and Mini-Batch Gradient Descent, which are implemented on a Linear regression dataset, and hence determine the computational complexity and other factors like learning rate, batch size and number of iterations which affect the efficiency of the algorithm.

Stochastic Gradient Descent

10.36227/techrxiv.14544000.v1

Cite

Citations (2)

Accelerating Extreme Search Based on Natural Gradient Descent with Beta Distribution

Pavel Lyakhov Ruslan Abdulkadirov

Natural gradient descent is an optimization method developed from the information geometry. It works well for many applications due to the better convergence and can be a good alternative for gradient descent and stochastic gradient descent in machine learning and statistics. The goal of this work is to propose a natural gradient descent algorithm with the beta distribution and the stepsize adaptation. We compare the minimizing process of gradient descent with natural gradient descent with respect to Gauss and Beta distributions. Additionally, the calculating of the Fisher matrix for computing natural gradient will be represented.

Stochastic Gradient Descent

10.1109/ent50460.2021.9681769

Cite

Citations (4)

Training a Two Layer ReLU Network Analytically

arXiv (Cornell University) (2023)

Adrian Barbu

Neural networks are usually trained with different variants of gradient descent based optimization algorithms such as stochastic gradient descent or the Adam optimizer. Recent theoretical work states that the critical points (where the gradient of the loss is zero) of two-layer ReLU networks with the square loss are not all local minima. However, in this work we will explore an algorithm for training two-layer neural networks with ReLU-like activation and the square loss that alternatively finds the critical points of the loss function analytically for one layer while keeping the other layer and the neuron activation pattern fixed. Experiments indicate that this simple algorithm can find deeper optima than Stochastic Gradient Descent or the Adam optimizer, obtaining significantly smaller training loss values on four out of the five real datasets evaluated. Moreover, the method is faster than the gradient descent methods and has virtually no tuning parameters.

Maxima and minima

Stochastic Gradient Descent

Activation function

Square (algebra)

10.48550/arxiv.2304.02972

Cite

Citations (0)

Stochastic Gradient Descent

Machine learning (2015)

Sergios Theodoridis

Stochastic Gradient Descent

Descent (aeronautics)

Stochastic Approximation

Descent direction

10.1016/b978-0-12-801522-3.00005-7

Cite

Citations (36)

Training a Two-Layer ReLU Network Analytically

Sensors (2023)

Adrian Barbu

Neural networks are usually trained with different variants of gradient descent-based optimization algorithms such as the stochastic gradient descent or the Adam optimizer. Recent theoretical work states that the critical points (where the gradient of the loss is zero) of two-layer ReLU networks with the square loss are not all local minima. However, in this work, we will explore an algorithm for training two-layer neural networks with ReLU-like activation and the square loss that alternatively finds the critical points of the loss function analytically for one layer while keeping the other layer and the neuron activation pattern fixed. Experiments indicate that this simple algorithm can find deeper optima than stochastic gradient descent or the Adam optimizer, obtaining significantly smaller training loss values on four out of the five real datasets evaluated. Moreover, the method is faster than the gradient descent methods and has virtually no tuning parameters.

Maxima and minima

Stochastic Gradient Descent

Activation function

Square (algebra)

10.3390/s23084072

Cite

Citations (4)

Variance Reduction for Distributed Stochastic Gradient Descent

arXiv (Cornell University) (2015)

Soham De Gavin Taylor Tom Goldstein

Variance reduction (VR) methods boost the performance of stochastic gradient descent (SGD) by enabling the use of larger, constant stepsizes and preserving linear convergence rates. However, current variance reduced SGD methods require either high memory usage or an exact gradient computation (using the entire dataset) at the end of each epoch. This limits the use of VR methods in practical distributed settings. In this paper, we propose a variance reduction method, called VR-lite, that does not require full gradient computations or extra storage. We explore distributed synchronous and asynchronous variants that are scalable and remain stable with low communication frequency. We empirically compare both the sequential and distributed algorithms to state-of-the-art stochastic optimization methods, and find that our proposed algorithms perform favorably to other stochastic methods.

Variance reduction

Stochastic Gradient Descent

10.48550/arxiv.1512.01708

Cite

Citations (8)

Breaking the curse of kernelization: budgeted stochastic gradient descent for large-scale SVM training

Zhuang Wang Koby Crammer Slobodan Vučetić

Online algorithms that process one example at a time are advantageous when dealing with very large data or with data streams. Stochastic Gradient Descent (SGD) is such an algorithm and it is an attractive choice for online Support Vector Machine (SVM) training due to its simplicity and effectiveness. When equipped with kernel functions, similarly to other SVM learning algorithms, SGD is susceptible to the curse of kernelization that causes unbounded linear growth in model size and update time with data size. This may render SGD inapplicable to large data sets. We address this issue by presenting a class of Budgeted SGD (BSGD) algorithms for large-scale kernel SVM training which have constant space and constant time complexity per update. Specifically, BSGD keeps the number of support vectors bounded during training through several budget maintenance strategies. We treat the budget maintenance as a source of the gradient error, and show that the gap between the BSGD and the optimal SVM solutions depends on the model degradation due to budget maintenance. To minimize the gap, we study greedy budget maintenance methods based on removal, projection, and merging of support vectors. We propose budgeted versions of several popular online SVM algorithms that belong to the SGD family. We further derive BSGD algorithms for multi-class SVM training. Comprehensive empirical results show that BSGD achieves higher accuracy than the state-of-the-art budgeted online algorithms and comparable to non-budget algorithms, while achieving impressive computational efficiency both in time and space during training and prediction.

Kernelization

Stochastic Gradient Descent

Kernel (algebra)

Cite

Citations (184)

Accelerating Stochastic Gradient Descent using Predictive Variance Reduction

neural information processing systems (2013)

Rie Johnson Tong Zhang

Stochastic gradient descent is popular for large scale optimization but has slow convergence asymptotically due to the inherent variance. To remedy this problem, we introduce an explicit variance reduction method for stochastic gradient descent which we call stochastic variance reduced gradient (SVRG). For smooth and strongly convex functions, we prove that this method enjoys the same fast convergence rate as those of stochastic dual coordinate ascent (SDCA) and Stochastic Average Gradient (SAG). However, our analysis is significantly simpler and more intuitive. Moreover, unlike SDCA or SAG, our method does not require the storage of gradients, and thus is more easily applicable to complex problems such as some structured prediction problems and neural network learning.

Variance reduction

Stochastic Gradient Descent

Source

Cite

Citations (1,981)

A mini-batch stochastic conjugate gradient algorithm with variance reduction

Journal of Global Optimization (2022)

Caixia Kou Han Yang

Stochastic Gradient Descent

Variance reduction

10.1007/s10898-022-01205-4

Cite