ridge regression
(1.3 hours to learn)
Summary
A problem with vanilla linear regression is that it can overfit, by forcing the learned parameters to match all the idiosyncrasies of the training data. Ridge regression, or regularized linear regression, is a way of extending the cost function with a regularizer which penalizes large weights. This leads to simpler solutions and often improves generalization performance. This idea of regularization can be used to improve the generalization performance of many other statistical models as well.
Context
This concept has the prerequisites:
- linear regression (Ridge regression is a variant on linear regression.)
- generalization (Ridge regression is a way of improving generalization.)
Core resources (read/watch one of the following)
-Free-
→ The Elements of Statistical Learning
A graudate-level statistical learning textbook with a focus on frequentist methods.
Location:
Section 3.4.3, subsection "Ridge regression," pages 59-64
Supplemental resources (the following are optional, but you may find them useful)
-Free-
→ Coursera: Machine Learning (2013)
An online machine learning course aimed at a broad audience.
Other notes:
- Click on "Preview" to see the videos.
-Paid-
→ Pattern Recognition and Machine Learning
A textbook for a graduate machine learning course, with a focus on Bayesian methods.
Location:
Section 3.1.4, "Regularized least squares," pages 144-146
Other notes:
- Don't worry about deriving the closed-form solution in equation (3.28).
→ Machine Learning: a Probabilistic Perspective
A very comprehensive graudate-level machine learning textbook.
Location:
Section 7.5.1, pages 225-227
Other notes:
- Don't worry about deriving the closed-form solution in equation (7.33).
See also
- The closed-form solution to linear regression can be extended to ridge regression.
- Ridge regression is an example of regularization .
- Ridge regression can be viewed as a Bayesian model with a Gaussian prior over the parameters.
- The number of effective parameters in ridge regression is smaller than the actual number of parameters.