weight decay in neural networks

(50 minutes to learn)

Summary

When training neural networks, it is common to use "weight decay," where after each update, the weights are multiplied by a factor slightly less than 1. This prevents the weights from growing too large, and can be seen as gradient descent on a quadratic regularization term.

Context

This concept has the prerequisites:

Core resources (read/watch one of the following)

-Free-

Coursera: Neural Networks for Machine Learning (2012)
An online course by Geoff Hinton, who invented many of the core ideas behind neural nets and deep learning.
  • Lecture "Overview of ways to improve generalization"
  • Lecture "Limiting the size of the weights"
Author: Geoffrey E. Hinton

-Paid-

See also

  • Weight decay is an example of a regularization method. (go to concept)
  • The $L_2$ norm of the weights isn't necessarily a good regularizer for neural nets. Some more principled alternatives include:
    • Tikhonov regularization, which rewards invariance to noise in the inputs (go to concept)
    • Tangent propagation, which rewards invariance to irrelevant transformations of the inputs such as translation and scalling (go to concept)
    Early stopping is another strategy to prevent overfitting in neural nets. (go to concept)