Bayesian parameter estimation
(1.8 hours to learn)
Summary
In the Bayesian framework, we treat the parameters of a statistical model as random variables. The model is specified by a prior distribution over the values of the variables, as well as an evidence model which determines how the parameters influence the observed data. When we condition on the observations, we get the posterior distribution over parameters. The term ``Bayesian parameter estimation'' is deceptive, because often we can skip the parameter estimation step entirely. Rather, we integrate out the parameters and directly make predictions about future observables.
Context
This concept has the prerequisites:
- Bayes' rule (Bayes' rule is an important conceptual component of Bayesian parameter estimation.)
- beta distribution (The beta-Bernoulli distribution is an instructive example of Bayesian parameter estimation.)
- conditional distributions (In Bayesian parameter estimation, we need to reason with the conditional distributions over parameters.)
- multivariate distributions (In Bayesian parameter estimation, we need to marginalize out the parameters in order to make predictions.)
- expectation and variance (A common use of the predictive distribution is computing expectations.)
Goals
- Know what the terms "prior" and "likelihood function" refer to
- Be able to compute the posterior distribution using Bayes' Rule
- Know what the posterior predictive distribution is and how to compute it analytically for a simple example (e.g. a beta-Bernoulli model)
- What is a conjugate prior, and why is it useful?
- Why can the posterior distribution be given in terms of pseudocounts when a conjugate prior is used?
- What is the maximum a-posteriori (MAP) approximation? Give an example where the predictions differ between the MAP parameters and the posterior predictive distribution.
Core resources (read/watch one of the following)
-Free-
→ Bayesian Reasoning and Machine Learning
A textbook for a graudate machine learning course.
→ Coursera: Probabilistic Graphical Models (2013)
An online course on probabilistic graphical models.
Additional dependencies:
- maximum likelihood
Other notes:
- If you're not familiar with Bayes nets, don't worry: most of these lectures don't depend on them.
- Click on "Preview" to see the videos.
-Paid-
→ Probabilistic Graphical Models: Principles and Techniques
A very comprehensive textbook for a graduate-level course on probabilistic AI.
Location:
Section 17.3, pgs. 733-741
→ Machine Learning: a Probabilistic Perspective
A very comprehensive graudate-level machine learning textbook.
Location:
Sections 3.1-3.3, pgs. 65-78
Supplemental resources (the following are optional, but you may find them useful)
-Free-
→ Coursera: Neural Networks for Machine Learning (2012)
An online course by Geoff Hinton, who invented many of the core ideas behind neural nets and deep learning.
Location:
Lecture, "Introduction to the full Bayesian approach"
-Paid-
→ Mathematical Statistics and Data Analysis
An undergraduate statistics textbook.
Location:
Section 8.6, "The Bayesian approach to parameter estimation," up through 8.6.1, "Further remarks on priors," pages 285-296
→ Bayesian Data Analysis
A Bayesian statistics textbook with a focus on practical issues.
Location:
Sections 2.1-2.3, pgs. 33-39
→ Pattern Recognition and Machine Learning
A textbook for a graduate machine learning course, with a focus on Bayesian methods.
Location:
Section 2.1, pgs. 68-74
See also
- Some other parameter estimation methods include: Ultimately, we don't just want to learn parameters, we want to use them for something. Bayesian decision theory concerns how to act based on our inferences from the data.
- Usually, the posterior over parameters and the predictive distribution can't be computed in closed form. Here are some strategies for getting approximate solutions:
- maximum a posteriori (MAP) estimation , i.e. finding the most likely parameters
- variational Bayes , a framework for approximating intractable posterior distributions with tractable ones
- markov chain Monte Carlo (MCMC) , a set of techniques for approximately sampling from the posterior
- While Bayesian parameter estimation attenuates overfitting, it doesn't solve the problem completely. Strategies for controlling model complexity include: The choice of a prior distribution over parameters is not always obvious. Often we choose them based on criteria like the following:
- conjugate priors , where the prior and posterior have the same functional form
- uninformative priors , which try to say as little as possible about the parameter
- Bayesian model averaging , where we average over multiple choices of the prior
- hierarchical models , where the prior for one problem comes from information obtained from related problems
- eliciting priors from human experts