cross validation
(1.4 hours to learn)
Summary
Cross validation is the process of partitioning an observed dataset into a training dataset and a testing dataset and then performing the statistical analysis on the training dataset (e.g. learning the parameters of a distribution used to describe the data) and then validated using the testing dataset (e.g. measuring how well the learned distribution describes the testing dataset). Following the validation step, a new, untested portion of the training dataset becomes the testing dataset and the previous testing dataset is incorporated into the training dataset. This cycle repeats until all data has been tested. This process is used to test how well a statistical analysis generalizes to new data.
Context
This concept has the prerequisites:
- loss function (the performance on the training/testing set is typically measured using a loss function)
Goals
- Understand the common types of cross validation: holdout method, random subsamples, K-fold, leave-one out
Core resources (read/watch one of the following)
-Free-
→ Wikipedia
→ Mathematical Monk: Machine Learning (2011)
Online videos on machine learning.
Other notes:
- discusses cross validation from a decision-theoretic perspective
Supplemental resources (the following are optional, but you may find them useful)
-Free-
→ djweissML Video Lectures
See also
- Cross-validation is a common technique used to measure the generalization of a statistical model on all data present in a dataset