# stochastic gradient descent

(1.5 hours to learn)

## Summary

Stochastic gradient descent (SGD) is an iterative optimization algorithm that can be applied to functions that are a linear combination of differentiable functions. These types of functions often arise when the full objective function is a linear combination of objective functions at each data point, e.g. a least squares objective function. While batch gradient descent uses the full gradient of the function, SGD approximates the full gradient by using the gradient at each of the functions in the linear combination, e.g. the gradient of the objective function at each data point. SGD is often used to optimize non-convex functions, e.g. those that arise in neural networks.

## Context

This concept has the prerequisites:

## Goals

- Understand the difference between stochastic gradient descent and batch gradient descent.
- When does stochastic gradient descent provide a reasonable approximation to the full gradient?
- What is the interpretation of the learning rate for stochastic gradient descent compared to full gradient descent?

## Core resources (read/watch one of the following)

## -Free-

→ Coursera: Machine Learning (2013)

An online machine learning course aimed at a broad audience.

Additional dependencies:

- linear regression

Other notes:

- Click on "Preview" to see the videos.

## Supplemental resources (the following are optional, but you may find them useful)

## -Free-

→ Wikipedia

→ Coursera: Machine Learning

An online machine learning course aimed at advanced undergraduates.

Location:
Stochastic gradient descent

Other notes:

- Click on "Preview" to see the videos.

## See also

- Natural gradient allows us to speed up stochastic gradient descent by accounting for the curvature of the objective function.
- Second-order optimization methods more generally often converge faster than first order methods, though they are harder to do in a stochastic framework.
- In large-scale machine learning , stochastic gradient descent achieves a good tradeoff between statistical error and optimization error.