cross validation

(1.4 hours to learn)


Cross validation is the process of partitioning an observed dataset into a training dataset and a testing dataset and then performing the statistical analysis on the training dataset (e.g. learning the parameters of a distribution used to describe the data) and then validated using the testing dataset (e.g. measuring how well the learned distribution describes the testing dataset). Following the validation step, a new, untested portion of the training dataset becomes the testing dataset and the previous testing dataset is incorporated into the training dataset. This cycle repeats until all data has been tested. This process is used to test how well a statistical analysis generalizes to new data.


This concept has the prerequisites:


  • Understand the common types of cross validation: holdout method, random subsamples, K-fold, leave-one out

Core resources (read/watch one of the following)


Mathematical Monk: Machine Learning (2011)
Online videos on machine learning.
Other notes:
  • discusses cross validation from a decision-theoretic perspective

Supplemental resources (the following are optional, but you may find them useful)


djweissML Video Lectures

See also

  • Cross-validation is a common technique used to measure the generalization of a statistical model on all data present in a dataset