weight decay in neural networks
(50 minutes to learn)
When training neural networks, it is common to use "weight decay," where after each update, the weights are multiplied by a factor slightly less than 1. This prevents the weights from growing too large, and can be seen as gradient descent on a quadratic regularization term.
This concept has the prerequisites:
- backpropagation (Weight decay is used as part of the backpropagation algorithm.)
- ridge regression (Weight decay is a way of implementing an L2 regularization term.)
Core resources (read/watch one of the following)
→ Coursera: Neural Networks for Machine Learning (2012)
An online course by Geoff Hinton, who invented many of the core ideas behind neural nets and deep learning.
- Lecture "Overview of ways to improve generalization"
- Lecture "Limiting the size of the weights"
→ Pattern Recognition and Machine Learning
A textbook for a graduate machine learning course, with a focus on Bayesian methods.
Location: Sections 5.5-5.5.1, pages 256-259
- Weight decay is an example of a regularization method. (go to concept)
- The $L_2$ norm of the weights isn't necessarily a good regularizer for neural nets. Some more principled alternatives include:
- Tikhonov regularization, which rewards invariance to noise in the inputs (go to concept)
- Tangent propagation, which rewards invariance to irrelevant transformations of the inputs such as translation and scalling (go to concept)
- create concept: shift + click on graph
- change concept title: shift + click on existing concept
- link together concepts: shift + click drag from one concept to another
- remove concept from graph: click on concept then press delete/backspace
- add associated content to concept: click the small circle that appears on the node when hovering over it
- other actions: use the icons in the upper right corner to optimize the graph placement, preview the graph, or download a json representation