Stanford CS229: Machine Learning

Created by: Roger Grosse
Intended for: CS229 students, anyone interested in machine learning

CS229 is Stanford's graduate course in machine learning, currently taught by Andrew Ng. It provides an overview of techniques for supervised, unsupervised, and reinforcement learning, as well as some results from computational learning theory. It's a very popular course, with hundreds of students every year.

This outline is based on how the course was taught in 2006. Here is the course web page, with a set of lecture notes. You can also find all the lecture videos here. Professor Ng also taught a more famous machine learning course on Coursera, which is a much simplified version of this course with most of the math removed.

Weeks 1-4: supervised learning

Supervised learning refers to the setting where you have a well-defined task you would like a machine to perform, and you have a labeled training dataset consisting of input/output pairs. This is the best-understood machine learning setting, and there are some powerful general-purpose algorithms which perform well across a wide range of problems. The first part of the course covers some of these algorithms, as well as some principles that guide you in how to apply them.

The course starts off with linear regression, one of the best-understood machine learning methods. In the most basic case, one wishes to predict a scalar-valued variable (e.g. housing price) as a linear function of various inputs (e.g. area), and the performance measure is the squared error in predicting the target. This problem has a closed-form solution (derived using linear algebra), or alternatively, can be solved using gradient descent. The assumption of a linear function is a strong constraint, but basis function expansions give a way to fit nonlinear models using a linear learning algorithm.

The linear regression algorithm can be generalized to other kinds of targets, although typically these don't have closed-form solutions. The case where the targets are binary-valued is known as binary classification. CS229 covers two binary classification algorithms: the perceptron, and logistic regression. While logistic regression doesn't have a closed form solution, it can be solved efficiently with iterative reweighted least squares (IRLS), a special case of Newton's method where each of the updates has a form very similar to linear regression. Generalized linear models (GLMs) are a more general class of regression models which can handle many different kinds of targets; the targets are modeled as an exponential family distribution whose parameters are a linear function of the inputs.

Linear regression, logistic regression, and GLMs are all examples of discriminative models, where one directly models the mapping from inputs to targets. Another approach is generative modeling, where one models the joint distribution over inputs and targets. Gaussian discriminant analysis is a generative classification algorithm which fits a multivariate Gaussian distribution to the (real-valued) inputs for each value of the target. Naive Bayes is an analogous model for the case where the inputs are binary.

The course then talks about support vector machines (SVMs), one of the most successful black-box classification algorithms. The SVM can be thought of as an algorithm which aims to find the linear decision boundary which separates the two classes with the largest margin. If the classes aren't separable, one can allow (but penalize) violations of the margin constraint, resulting in the soft-margin SVM. The optimality conditions are derived based on the KKT conditions (from the theory of convex optimization), leading to an efficient algorithm known as sequential minimal optimization. What makes SVMs powerful is that they can be kernelized: the data are mapped into a high-dimensional space, such that a linear decision boundary in this space can correspond to a complex, nonlinear one in the original space. The kernel trick allows this mapping to be done implicitly, without ever having to explicitly represent points in the high-dimensional space.

Weeks 5-6: Learning theory and regularization

Usually we're not just interested in making predictions on the training data; we'd like our algorithms to generalize to new examples. The second part of the course talks about how to get models to generalize well, both in theory and in practice.

In the case of linear regression, under certain simplifying assumptions, the prediction error can be decomposed as a sum of three terms:

  1. bias, the systematic error resulting from having too simple a model class
  2. variance. error resulting from random variability in the training data
  3. Bayes error, the part of the error which is unavoidable, since it reflects inherent variability in the targets

This is known as the bias-variance decomposition. If the algorithm suffers from bias, it is said to underfit, whereas if it suffers from variance, it is said to overfit. While the bias-variance decomposition doesn't hold exactly outside the squared error setting, we still loosely use bias and variance as synonyms for underfitting and overfitting.

Probably approximately correct (PAC) learning is a theoretical framework for analyzing generalization. If we assume a discrete hypothesis class, and training and test samples drawn i.i.d. from the same distribution, we can show that the best hypothesis on the training set probably generalizes almost as well as the true best hypothesis (where the gap depends on the size of the hypothesis class and the number of training examples). The Vapnik-Chervonenkis (VC) dimension gives an intrinsic complexity measure for continuous hypothesis spaces, thereby letting us generalize PAC learning results to continuous spaces.

It's pretty hard to apply these theoretical results in practice. However, there are some very effective practical techniques for balancing bias and variance. Two techniques for reducing the effective complexity of a model class are feature selection and regularization (see, e.g., ridge regression). Both of these approaches typically require specifing one or more hyperparameters, but these can be chosen by minimizing error on a held-out validation set.

Weeks 7-8: unsupervised learning

The first two sections focused on supervised learning, where one has a well-defined task, and a set of labeled input/output pairs. In unsupervised learning, by contrast, one simply has lots of unlabeled data, and the goal is to find interesting patterns in it. Unsupervised learning is much less well understood than supervised learning, and getting useful results often requires building in considerable problem-specific structure.

One of the canonical unsupervised learning tasks is clustering, where the goal is to partition the data points into groups, such that the points within a group are similar, and the points in different groups are dissimilar. A simple (but often pretty effective) clustering algorithm is k-means, which alternates between computing the center (mean) of each cluster and assigning each point to the cluster with the nearest center. One can also take a probabilistic approach by fitting a mixture of Gaussians model. The maximum likelihood solution (or at least a local maximum) can be computed using the expectation-maximization (EM) algorithm. While the EM procedure is somewhat analogous to k-means in the case of Gaussian mixtures, it is a much more general algorithm and can be used to fit a wide variety of probabilistic models.

Principal component analysis is a different kind of unsupervised learning algorithm, where one doesn't want a set of discrete categories, but instead a small number of continuous factors underlying the data. PCA isn't explicitly a probabilistic model, but there are probabilistic variants of it. Factor analysis is a probabilistic model with continuous latent factors, and differs from PCA in that it fits separate noise parameters for each of the inputs. Like the mixture of Gaussians model, it can be fit using EM.

Independent component analysis is a very different kind of latent factor model from PCA. While PCA basically tries to find the subspace of highest variability, ICA is more about trying to uncover the underlying components themselves. It's provably impossible to recover the individual components if they are Gaussian distributed (the assumption made by factor analysis and probabilistic PCA), but if they have some other distribution (typically a heavy-tailed one), their contributions can be separated out. ICA can yield more interpretable components than PCA, and has been pretty successful at disentangling different factors in EEG signals.

Weeks 9-10: reinforcement learning

The final learning paradigm considered in CS229 is reinforcement learning. This sits somehwere in between supervised and unsupervised learning: the learning agent isn't given examples of the correct behavior, but it is given a reward signal which measures the goodness of its behaviors.

A model often assumed in the reinforcement learning setting is the Markov decision process (MDP). In an MDP, all the relevant information is encapsulated in the state, and in each time step, the agent can choose one of several actions which affect how it (stochastically) transitions to other states. In each time step, the agent receives a reward depending both on the current state and on its action, and its goal is to maximize a (possibly discounted) sum of the rewards over all time steps. Importantly, the dynamics are Markovian, in that the rewards and transitions depend only on the current state, and not on the past history of the system.

The goal is to learn a good policy, or mapping from states to actions. If all the transition probabilities and reward distributions are known, the optimal policy can be characterized using the Bellman equations. These equations are usually very hard to solve exactly, but they can be solved approximately using value iteration or policy iteration. If the number of steps is finite, the solution can be computed exactly using dynamic programming (although doing so might be very expensive).

If the dynamics or reward function is unknown, temporal difference learning is one approach to learning a good policy. In unknown environments, agents must balance the exploration/exploitation tradeoff: would they rather pay a high cost now to learn about the environment (exploration), or continue playing the best strategy given their current knowledge, even if it might be suboptimal (exploitation)? A simple model of the exploration/exploitation tradeoff is the multi-armed bandit.