logo
    Improving kernel online learning with a snapshot memory
    2
    Citation
    10
    Reference
    10
    Related Paper
    Citation Trend
    Keywords:
    Stochastic Gradient Descent
    Variance reduction
    Snapshot (computer storage)
    Kernelization
    Benchmark (surveying)
    Information is mounting exponentially, and the world is moving to hunt knowledge with the help of Big Data. The labelled data is used for automated learning and data analysis which is termed as Machine Learning. Linear Regression is a statistical method for predictive analysis. Gradient Descent is the process which uses cost function on gradients for minimizing the complexity in computing mean square error. This work presents an insight into the different types of Gradient descent algorithms namely, Batch Gradient Descent, Stochastic Gradient Descent and Mini-Batch Gradient Descent, which are implemented on a Linear regression dataset, and hence determine the computational complexity and other factors like learning rate, batch size and number of iterations which affect the efficiency of the algorithm.
    Stochastic Gradient Descent
    Natural gradient descent is an optimization method developed from the information geometry. It works well for many applications due to the better convergence and can be a good alternative for gradient descent and stochastic gradient descent in machine learning and statistics. The goal of this work is to propose a natural gradient descent algorithm with the beta distribution and the stepsize adaptation. We compare the minimizing process of gradient descent with natural gradient descent with respect to Gauss and Beta distributions. Additionally, the calculating of the Fisher matrix for computing natural gradient will be represented.
    Stochastic Gradient Descent
    Neural networks are usually trained with different variants of gradient descent based optimization algorithms such as stochastic gradient descent or the Adam optimizer. Recent theoretical work states that the critical points (where the gradient of the loss is zero) of two-layer ReLU networks with the square loss are not all local minima. However, in this work we will explore an algorithm for training two-layer neural networks with ReLU-like activation and the square loss that alternatively finds the critical points of the loss function analytically for one layer while keeping the other layer and the neuron activation pattern fixed. Experiments indicate that this simple algorithm can find deeper optima than Stochastic Gradient Descent or the Adam optimizer, obtaining significantly smaller training loss values on four out of the five real datasets evaluated. Moreover, the method is faster than the gradient descent methods and has virtually no tuning parameters.
    Maxima and minima
    Stochastic Gradient Descent
    Activation function
    Square (algebra)
    Citations (0)
    Stochastic Gradient Descent
    Descent (aeronautics)
    Stochastic Approximation
    Descent direction
    Neural networks are usually trained with different variants of gradient descent-based optimization algorithms such as the stochastic gradient descent or the Adam optimizer. Recent theoretical work states that the critical points (where the gradient of the loss is zero) of two-layer ReLU networks with the square loss are not all local minima. However, in this work, we will explore an algorithm for training two-layer neural networks with ReLU-like activation and the square loss that alternatively finds the critical points of the loss function analytically for one layer while keeping the other layer and the neuron activation pattern fixed. Experiments indicate that this simple algorithm can find deeper optima than stochastic gradient descent or the Adam optimizer, obtaining significantly smaller training loss values on four out of the five real datasets evaluated. Moreover, the method is faster than the gradient descent methods and has virtually no tuning parameters.
    Maxima and minima
    Stochastic Gradient Descent
    Activation function
    Square (algebra)
    Citations (4)
    Variance reduction (VR) methods boost the performance of stochastic gradient descent (SGD) by enabling the use of larger, constant stepsizes and preserving linear convergence rates. However, current variance reduced SGD methods require either high memory usage or an exact gradient computation (using the entire dataset) at the end of each epoch. This limits the use of VR methods in practical distributed settings. In this paper, we propose a variance reduction method, called VR-lite, that does not require full gradient computations or extra storage. We explore distributed synchronous and asynchronous variants that are scalable and remain stable with low communication frequency. We empirically compare both the sequential and distributed algorithms to state-of-the-art stochastic optimization methods, and find that our proposed algorithms perform favorably to other stochastic methods.
    Variance reduction
    Stochastic Gradient Descent
    Citations (8)
    Online algorithms that process one example at a time are advantageous when dealing with very large data or with data streams. Stochastic Gradient Descent (SGD) is such an algorithm and it is an attractive choice for online Support Vector Machine (SVM) training due to its simplicity and effectiveness. When equipped with kernel functions, similarly to other SVM learning algorithms, SGD is susceptible to the curse of kernelization that causes unbounded linear growth in model size and update time with data size. This may render SGD inapplicable to large data sets. We address this issue by presenting a class of Budgeted SGD (BSGD) algorithms for large-scale kernel SVM training which have constant space and constant time complexity per update. Specifically, BSGD keeps the number of support vectors bounded during training through several budget maintenance strategies. We treat the budget maintenance as a source of the gradient error, and show that the gap between the BSGD and the optimal SVM solutions depends on the model degradation due to budget maintenance. To minimize the gap, we study greedy budget maintenance methods based on removal, projection, and merging of support vectors. We propose budgeted versions of several popular online SVM algorithms that belong to the SGD family. We further derive BSGD algorithms for multi-class SVM training. Comprehensive empirical results show that BSGD achieves higher accuracy than the state-of-the-art budgeted online algorithms and comparable to non-budget algorithms, while achieving impressive computational efficiency both in time and space during training and prediction.
    Kernelization
    Stochastic Gradient Descent
    Kernel (algebra)
    Citations (184)
    Stochastic gradient descent is popular for large scale optimization but has slow convergence asymptotically due to the inherent variance. To remedy this problem, we introduce an explicit variance reduction method for stochastic gradient descent which we call stochastic variance reduced gradient (SVRG). For smooth and strongly convex functions, we prove that this method enjoys the same fast convergence rate as those of stochastic dual coordinate ascent (SDCA) and Stochastic Average Gradient (SAG). However, our analysis is significantly simpler and more intuitive. Moreover, unlike SDCA or SAG, our method does not require the storage of gradients, and thus is more easily applicable to complex problems such as some structured prediction problems and neural network learning.
    Variance reduction
    Stochastic Gradient Descent
    Citations (1,981)