Stochastic Gradient Descent and Its Variants in Machine Learning

By: Jeffrey Kondas, Technology Fellow

Abstract

Stochastic Gradient Descent (SGD) has become a cornerstone in the field of machine learning due to its efficiency in dealing with large datasets. This white paper explores SGD, its limitations, and the evolution of its variants, providing an in-depth look at how these algorithms enhance the training of neural networks. We will cover theoretical foundations, practical implementation, and discuss the implications of these methods on the efficiency and convergence of machine learning models.

Introduction

Machine learning models, especially neural networks, require optimization algorithms to adjust parameters iteratively to minimize error. SGD stands out for its ability to handle large datasets by processing them in smaller batches or even single examples, leading to faster learning with less computational overhead.

Stochastic Gradient Descent (SGD)

Fundamental Concept

SGD updates parameters based on the gradient of the loss function for one or a small subset of training examples:

w_{new} = w_{old} - \eta \cdot \nabla J(w; \mathbf{x}^{(i)}, y^{(i)})

Where:

\eta is the learning rate
\nabla J(w; \mathbf{x}^{(i)}, y^{(i)}) is the gradient of the loss function J for a single example or mini-batch.

Code Sample (Python):

pythonimport numpy as np def sgd_update(w, learning_rate, gradient): return w - learning_rate * gradient

Challenges

Learning Rate Sensitivity: An inappropriate learning rate can lead to slow convergence or divergence.
Noisy Updates: The stochastic nature can cause high variance in parameter updates.
Local Minima and Saddle Points: SGD might struggle with complex loss landscapes.

Source:

Ruder, S. (2016). “An overview of gradient descent optimization algorithms.” Link

Variants of SGD

Mini-Batch SGD

Mini-Batch SGD uses a small subset of data to compute gradients, offering a compromise between efficiency and stability:

w_{new} = w_{old} - \eta \cdot \frac{1}{m} \sum_{i=1}^{m} \nabla J(w; \mathbf{x}^{(i)}, y^{(i)})

Where m is the size of the mini-batch.

Code Sample (Python):

pythondef mini_batch_sgd_update(w, learning_rate, gradients, batch_size): return w - learning_rate * np.mean(gradients, axis=0)

Momentum SGD

Momentum adds a fraction of the previous update to the current one, helping to maintain momentum in the loss landscape:

v_t = \gamma v_{t-1} + \eta \nabla J(w)

w_{new} = w_{old} - v_t

Where \gamma is the momentum coefficient.

Code Sample (Python):

pythondef momentum_sgd_update(w, learning_rate, gradient, velocity, momentum): velocity = momentum * velocity - learning_rate * gradient return w + velocity, velocity

Source:

Sutskever, I., et al. (2013). “On the importance of initialization and momentum in deep learning.” ICML. Link

Nesterov Accelerated Gradient (NAG)

NAG looks ahead by calculating the gradient after a partial momentum update:

v_t = \gamma v_{t-1} - \eta \nabla J(w - \gamma v_{t-1})

w_{new} = w_{old} + v_t

Code Sample (Python):

pythondef nag_sgd_update(w, learning_rate, gradient, velocity, momentum): look_ahead_w = w + momentum * velocity velocity = momentum * velocity - learning_rate * gradient(look_ahead_w) return w + velocity, velocity

Source:

Nesterov, Y. (1983). “A method for solving the convex programming problem with convergence rate O(1/k^2).” Soviet Mathematics Doklady.

AdaGrad

AdaGrad adapts learning rates for each parameter, reducing them for parameters with large gradients:

G_t = G_{t-1} + (\nabla J(w))^2

w_{new} = w_{old} - \frac{\eta}{\sqrt{G_t + \epsilon}} \cdot \nabla J(w)

Where \epsilon is a small constant to avoid division by zero.

Code Sample (Python):

pythondef adagrad_update(w, learning_rate, gradient, g, epsilon=1e-8): g += gradient ** 2 return w - (learning_rate / (np.sqrt(g) + epsilon)) * gradient, g

Source:

Duchi, J., et al. (2011). “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.” Journal of Machine Learning Research.

RMSprop

RMSprop uses a moving average of squared gradients to normalize the gradient:

E[g^2]_t = \rho E[g^2]_{t-1} + (1 - \rho)(\nabla J(w))^2

w_{new} = w_{old} - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} \cdot \nabla J(w)

Where \rho is the decay rate.

Code Sample (Python):

pythondef rmsprop_update(w, learning_rate, gradient, eg2, rho=0.9, epsilon=1e-8): eg2 = rho * eg2 + (1 - rho) * (gradient ** 2) return w - (learning_rate / np.sqrt(eg2 + epsilon)) * gradient, eg2

Source:

Tieleman, T., & Hinton, G. (2012). “Lecture 6.5—rmsprop: Divide the gradient by a running average of its recent magnitude.” COURSERA: Neural Networks for Machine Learning.

Adam (Adaptive Moment Estimation)

Adam combines momentum and RMSprop concepts, adjusting learning rates for each parameter:

m_t = \beta_1 m_{t-1} + (1 - \beta_1) \nabla J(w)

v_t = \beta_2 v_{t-1} + (1 - \beta_2) (\nabla J(w))^2

Bias correction:

\hat{m}_t = \frac{m_t}{1 - \beta_1^t}

\hat{v}_t = \frac{v_t}{1 - \beta_2^t}

Update:

w_{new} = w_{old} - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}

Code Sample (Python):

pythondef adam_update(w, learning_rate, gradient, m, v, t, beta1=0.9, beta2=0.999, epsilon=1e-8): m = beta1 * m + (1 - beta1) * gradient v = beta2 * v + (1 - beta2) * (gradient ** 2) m_hat = m / (1 - beta1 ** t) v_hat = v / (1 - beta2 ** t) return w - learning_rate * m_hat / (np.sqrt(v_hat) + epsilon), m, v

Source:

Kingma, D. P., & Ba, J. (2014). “Adam: A Method for Stochastic Optimization.” arXiv preprint arXiv:1412.6980.

Conclusion

SGD and its variants are pivotal in modern machine learning, each offering unique approaches to address different aspects of optimization. From handling large datasets to improving convergence speed, these algorithms provide tools for developers to train models more effectively. For further exploration, one should consider practical implementations in libraries like or , which implement these algorithms, offering a hands-on approach to understanding their application.

This white paper has aimed to provide a comprehensive overview while ensuring that proprietary or sensitive information is safeguarded through the use of emojis for redaction.

Published

January 29, 2025

JeffreyKondas in Tech | January 29, 2025

Stochastic Gradient Descent and Its Variants in Machine Learning

Related

Published

January 29, 2025

Write a Comment

Write a Comment