CS229 is Stanford's graduate course in machine learning, currently taught by Andrew Ng. It provides an overview of techniques for supervised, unsupervised, and reinforcement learning, as well as some results from computational learning theory. It's a very popular course, with hundreds of students every year. 
  
  This outline is based on how the course was taught in 2006. [Here](http://cs229.stanford.edu/) is the course web page, with a set of [lecture notes](http://cs229.stanford.edu/materials.html). You can also find all the lecture videos [here](http://see.stanford.edu/see/lecturelist.aspx?coll=348ca38a-3a6d-4052-937d-cb017338d7b1). Professor Ng also taught a more famous [machine learning course](https://www.coursera.org/course/ml) on Coursera, which is a much simplified version of this course with most of the math removed.
  
  
  ## Weeks 1-4: supervised learning
  
  [Supervised learning](http://en.wikipedia.org/wiki/Supervised_learning) refers to the setting where you have a well-defined task you would like a machine to perform, and you have a labeled training dataset consisting of input/output pairs. This is the best-understood machine learning setting, and there are some powerful general-purpose algorithms which perform well across a wide range of problems. The first part of the course covers some of these algorithms, as well as some principles that guide you in how to apply them.
  
  The course starts off with [linear regression](linear_regression), one of the best-understood machine learning methods. In the most basic case, one wishes to predict a scalar-valued variable (e.g. housing price) as a linear function of various inputs (e.g. area), and the performance measure is the squared error in predicting the target. This problem has a [closed-form solution](linear_regression_closed_form) (derived using linear algebra), or alternatively, can be solved using [gradient descent](gradient_descent). The assumption of a linear function is a strong constraint, but [basis function expansions](basis_function_expansions) give a way to fit nonlinear models using a linear learning algorithm.
  
  The linear regression algorithm can be generalized to other kinds of targets, although typically these don't have closed-form solutions. The case where the targets are binary-valued is known as [binary classification](binary_linear_classifiers). CS229 covers two binary classification algorithms: the [perceptron](perceptron), and [logistic regression](logistic_regression). While logistic regression doesn't have a closed form solution, it can be solved efficiently with [iterative reweighted least squares (IRLS)](logistic_regression_irls), a special case of [Newton's method](newtons_method_optimization) where each of the updates has a form very similar to linear regression. [Generalized linear models (GLMs)](generalized_linear_models) are a more general class of regression models which can handle many different kinds of targets; the targets are modeled as an [exponential family](exponential_families) distribution whose parameters are a linear function of the inputs.
  
  Linear regression, logistic regression, and GLMs are all examples of [discriminative models](http://en.wikipedia.org/wiki/Discriminative_model), where one directly models the mapping from inputs to targets. Another approach is [generative modeling](http://en.wikipedia.org/wiki/Generative_model), where one models the joint distribution over inputs and targets. [Gaussian discriminant analysis](gaussian_discriminant_analysis) is a generative classification algorithm which fits a [multivariate Gaussian distribution](multivariate_gaussian_distribution) to the (real-valued) inputs for each value of the target. [Naive Bayes](naive_bayes) is an analogous model for the case where the inputs are binary.
  
  The course then talks about [support vector machines (SVMs)](support_vector_machine), one of the most successful black-box classification algorithms. The SVM can be thought of as an algorithm which aims to find the linear decision boundary which separates the two classes with the largest margin. If the classes aren't separable, one can allow (but penalize) violations of the margin constraint, resulting in the [soft-margin SVM](soft_margin_svm). The [optimality conditions](svm_optimality_conditions) are derived based on the [KKT conditions](kkt_conditions) (from the theory of [convex optimization](convex_optimization)), leading to an efficient algorithm known as [sequential minimal optimization](sequential_minimal_optimization). What makes SVMs powerful is that they can be [kernelized](kernel_svm): the data are mapped into a high-dimensional space, such that a linear decision boundary in this space can correspond to a complex, nonlinear one in the original space. The [kernel trick](kernel_trick) allows this mapping to be done implicitly, without ever having to explicitly represent points in the high-dimensional space.
  
  
- ## Week 5: Learning theory and regularization
+ ## Weeks 5-6: Learning theory and regularization
  
  Usually we're not just interested in making predictions on the training data; we'd like our algorithms to [generalize](generalization) to new examples. The second part of the course talks about how to get models to generalize well, both in theory and in practice.
  
  In the case of linear regression, under certain simplifying assumptions, the prediction error can be decomposed as a sum of three terms: 
  
  1. bias, the systematic error resulting from having too simple a model class
  2. variance. error resulting from random variability in the training data
  3. Bayes error, the part of the error which is unavoidable, since it reflects inherent variability in the targets
  
  This is known as the [bias-variance decomposition](bias_variance_decomposition). If the algorithm suffers from bias, it is said to underfit, whereas if it suffers from variance, it is said to overfit. While the bias-variance decomposition doesn't hold exactly outside the squared error setting, we still loosely use bias and variance as synonyms for underfitting and overfitting.
  
- [Probably approximately correct (PAC) learning][pac_learning] is a theoretical framework for analyzing generalization. If we assume a discrete hypothesis class, and training and test samples drawn i.i.d. from the same distribution, we can show that the best hypothesis on the training set probably generalizes almost as well as the true best hypothesis (where the gap depends on the size of the hypothesis class and the number of training examples). The [Vapnik-Chervonenkis (VC) dimension](vc_dimension) gives an intrinsic complexity measure for continuous hypothesis spaces, thereby letting us generalize PAC learning results to continuous spaces.
+ [Probably approximately correct (PAC) learning](pac_learning) is a theoretical framework for analyzing generalization. If we assume a discrete hypothesis class, and training and test samples drawn i.i.d. from the same distribution, we can show that the best hypothesis on the training set probably generalizes almost as well as the true best hypothesis (where the gap depends on the size of the hypothesis class and the number of training examples). The [Vapnik-Chervonenkis (VC) dimension](vc_dimension) gives an intrinsic complexity measure for continuous hypothesis spaces, thereby letting us generalize PAC learning results to continuous spaces.
  
  It's pretty hard to apply these theoretical results in practice. However, there are some very effective practical techniques for balancing bias and variance. Two techniques for reducing the effective complexity of a model class are [feature selection](http://en.wikipedia.org/wiki/Feature_selection) and [regularization](http://en.wikipedia.org/wiki/Regularization_(mathematics)) (see, e.g., [ridge regression](ridge_regression)). Both of these approaches typically require specifing one or more hyperparameters, but these can be chosen by minimizing error on a held-out [validation set](cross_validation).