Deep learning

Created by: Roger Grosse
Intended for: machine learning practitioners

Probably the most accessible introduction to neural nets and deep learning is Geoff Hinton’s excellent Coursera course. Here you’ll learn about the key ideas and be able to implement simple versions of the algorithms yourself.

You can also check out one of several review papers, which give readable overviews of recent progress in the field:

But it’s one thing to learn the basics, and another to be able to get them to work well. The field isn’t at the point yet where you can just plug your data into the algorithm and have it work automatically. You’ll need to be able to diagnose problems: is the model overfitting? Is the optimization procedure getting stuck? Should you add more units? More layers? Unfortunately, there aren’t any recipes for these questions, and you’ll need to do a lot of tinkering and experimentation yourself. For this, you’ll need to have a deep understanding of the inner workings of the algorithms and how they relate to other key ideas in machine learning.

By following these prerequisites, you will be able to achieve this deep understanding.

Supervised models

If you’re interested in using neural nets, it’s likely that you’re interested in predicting something. Supervised learning is a machine learning framework where you have a particular task you’d like the computer to solve, and a training set where the correct predictions are labeled.

You should also be familiar with “shallow” machine learning algorithms, such as linear regression, logistic regression, and support vector machines. These are far easier to implement, and there also exist pretty good software packages. They serve as a sanity check for your neural net implementations: you should at least be able to beat these simple generic approaches. Plus, neural nets are built out of simple units which are closely related to these models. Therefore, by taking the time to learn about these, you automatically gain a deeper understanding of neural nets.

In order to have any hope of doing supervised learning, you need to understand the idea of generalization, the ability to make good predictions on novel examples. You’ll need to understand how to balance the tradeoff between underfitting and overfitting: you want your model to be expressive enough to model relevant aspects of the data, but not so complex that it “overfits” by modeling all the idiosyncrasies. In the case of regression, this can be formalized in terms of bias and variance, which provides a useful intuition more generally. You should be able to measure generalization performance using cross-validation.

The vanilla deep learning model is the feed-forward neural net, which is trained with backpropagation. These ideas have been around for decades, but only recently have computers become fast enough for them to work well at scale.

Vision is one of the major application areas of deep learning, and convolutional nets have been applied here with tremendous success.

While the basic model formulations are fairly simple, the details of getting them to work in practice are not. [give some practical guides]

Unsupervised models

In supervised learning, you have data labeled with the correct predictions for a particular task. But in many cases, labeled data is hard to obtain, or the correct behavior is hard to define. All you have is a lot of unlabeled data. This is the setting known as unsupervised learning.

What can you do with unlabeled data? One thing you can do is simply look for patterns. Maybe your data is explainable in terms of a small number of underlying factors, or dimensions. This can be captured with principal component analysis or factor analysis. Or maybe you think the data are better explained in terms of clusters, where data points within a cluster are more similar than data points in different clusters. This can be captured with k-means or mixture of Gaussians.

In the context of neural nets, there is another reason to care about unsupervised learning: it can often help you solve a supervised task better. In particular, unlabeled data is often much easier to obtain than labeled data. E.g., if you’re working on object recognition, labeling the objects in images is a laborious task, whereas unlabeled data includes the billions of images available on the Internet.

Generative pre-training has been shown to improve performance of supervised neural nets on a wide variety of tasks. The idea is that you start by training an unsupervised neural net on the unlabeled data (I’ll cover examples shortly), and then convert it to a supervised network with a similar architecture. As a result of having to model the data distribution, the network will be primed to pick up relevant structure. Also, for reasons that are still not very well understood, deep unsupervised models are often easier to train than deep supervised ones. Initializing from an unsupervised network helps the optimizer avoid local optima.

The evidence for generative pre-training is still mixed, and many of the most successful applications of deep neural nets have avoided it entirely, especially in the big data setting. But it has a good enough track record that it is worth being aware of.

So what are these unsupervised neural nets? The most basic one is probably the autoencoder, which is a feed-forward neural net which tries to predict its own input. While this isn’t exactly the world’s hardest prediction task, one makes it hard by somehow constraining the network. Often, this is done by introducing a bottleneck, where one or more of the hidden layers has much lower dimensionality than the inputs. Alternatively, one can constrain the hidden layer activations to be sparse (i.e. each unit activates only rarely), or feed the network corrupted versions of its inputs and make it reconstruct the clean ones (this is known as a denoising autoencoder).

Another approach to unsupervised learning is known as generative modeling. Here, one assumes the data are drawn from some underlying distribution, and attempts to model the distribution. Restricted Boltzmann machines (RBMs) are a simple generative neural network with a single hidden layer. They can be stacked to form multilayer generative models, including deep belief nets (DBNs) and deep Boltzmann machines (DBMs). There are a wide variety of variations on this basic idea, many of which are covered below.

Generative modeling is a deep and rich area, and you can find lots more examples in the Bayesian machine learning roadmap.

Optimization algorithms

You’ve defined your neural net architecture. How the heck do you train it? The basic workhorse for neural net training is stochastic gradient descent (SGD), where one visits a single training example at a time (or a “minibatch” of training examples), and takes a small step to reduce the loss on those examples. This requires computing the gradient of the loss function, which can be done using backpropagation. Be sure to check your gradient computations with finite differences to make sure you’ve derived them correctly. SGD is conceptually simple and easy to implement, and with a bit of tuning, can work very well in practice.

There is a broad class of optimization problems known as convex optimization, where SGD and other local search algorithms are guaranteed to find the global optimum. Much of machine learning research is focused on trying to formulate things as convex optimization problems. Unfortunately, deep neural net training is usually not convex, so you are only guaranteed to find a local optimum. This is a bit disappointing, but ultimately it’s something we can live with. For most feed-forward networks and generative networks, the local optima tend to be pretty reasonable. (Recurrent neural nets are a different story — more on that below.)

A bigger problem than local optima is that the curvature of the loss function can be pretty extreme. While neural net training isn’t convex, the problem of curvature also shows up for convex problems, and many of the techniques for dealing with it are borrowed from convex optimization. As general background, it’s useful to read the following sections of Boyd and Vandenberghe’s book, Convex Optimization: (gradient descent) (Newton’s method) While Newton’s method is very good at dealing with curvature, it is impractical for large-scale neural net training for two reasons. First, it is a batch method, so it requires visiting every training example in order to make a single step. Second, it requires constructing and inverting the Hessian matrix, whose dimension is the number of parameters. (Matrix inversion is only practical up to tens of thousands of parameters, whereas neural nets typically have millions.) Still, it serves as an “idealized” second-order training method which one can try to approximate. Practical algorithms for doing so include: - conjugate gradient - limited memory BFGS - Hessian-free optimization

Relationships with other machine learning techniques

Neural nets share non-obvious relationships with a variety of algorithms from the rest of machine learning. Understanding these relationships will help you decide when particular architectural decisions are appropriate.

Many neural net models can be seen as nonlinear generalizations of "shallow" models. Feed-forward neural nets are essentially nonlinear analogues of algorithms like logistic regression. Autoencoders can be seen as nonlinear analogues of dimensionality reduction algorithms like PCA.

Kernel methods are another set of techniques for converting linear algorithms into nonlinear ones. There is actually a surprising relationship between neural nets and kernels: Bayesian neural nets converge to Gaussian processes (a kernelized regression model) in the limit of infinitely many hidden units.

Relationship with the brain

If these models are called "neural" nets, it's natural to ask whether they have anything to do with the brain. In a certain sense, they don't: you can understand and apply the algorithms without knowing anything about how the brain works. Mathematically, feed-forward neural nets are just adaptive basis function expansions.

But the connections between neural net models and the brain run very deep indeed. Unfortunately, Metacademy doesn't have any neuroscience content (yet!), so the background links in this section will be fairly incomplete.

When trying to draw parallels between learning algorithms and the brain, we need to be precise about what level we're talking about. David Marr argued for explicitly separating different levels of analysis: computation, algorithms, and implementation. While not all researchers agree with this way of partitioning things, it's useful to keep in mind when trying to understand exactly what someone is claiming.


Jeff Hawkins's book On Intelligence aims to present a unifying picture of the computational role of the neocortex. While the theory itself is fairly speculative, the book is an engaging and accessible introduction to the structure of the visual cortex.

Many neural net models have learned similar response properties to neurons in the primary visual cortex (V1). Olshausen and Field's sparse coding model (background) was the first to demonstrate that a purely statistical learning algorithm discovered filters similar to those of V1. (Whether or not this is a neural net is a matter of opinion.) Since then, a wide variety of representation learning algorithms based on seemingly different ideas have recovered similar representations. Other statistical models have learned topological representations similar to the layout of cell types in V1. Karklin and Lewicki fit a more sophisticated statistical model which reproduced response properties of complex cells. While the connection between V1 and learned filters may seem tidy, Olshausen highlights a lot of things we still don't understand about V1.

For more on the neuroscience of the visual system, check out Eye, Brain, and Vision, a freely available book written by David Hubel, one of the pioneers who first studied V1.

There have also been neural nets explicitly proposed as models of the brain. Poggio's H-MAX model is a good example. DiCarlo showed that deep convolutional networks yield neurons which behave similarly to those high up in the primate visual hierarchy.

Feed-forward neural nets are not a particularly good model of the human visual system, where feedback connections play a significant role in processing. (The precise computational role of these connections is a major unsolved problem in neuroscience.) However, Poggio argues that feed-forward networks work as a model of the first ??? milliseconds of visual processing, where there isn't time for top-down signals to propagate.

Reinforcement learning algorithms are also commonly used to model phenomena from neuroscience.

Cognitive science

It's not just at the level of neurons that researchers have tried to draw connections between the brain and neural nets. Cognitive science refers to the interdisciplinary study of thought processes, and can be thought of a study of the mind rather than the brain. Connectionism was a branch of cognitive science, especially influential during the 1980s, which attempted to model high-level cognitive processes in terms of networks of neuron-like units. (Several of the most influential machine learning researchers came out of this tradition.)

McClelland and Rumelhart's book Parallel Distributed Processing is the Bible of connectionism. Some significant works in the field include:

Given this large body of work, one can ask, how strong is the evidence that these models are correct? This is a very subtle question when it comes to cognitive modeling. Connectionist models are not intended to be implementational descriptions (in Marr's sense); rather, the units and operations are meant to be representative of the sorts of computations that can be performed in a neural architecture. Also, the same phenomenon can often be explained at multiple levels, and Bayesian and connectionist interpretations aren't incompatible. Rather, one hopes to obtain models which yield increasingly accurate and insightful explanations of phenomena.