Bayesian parameter estimation
(1.8 hours to learn)
In the Bayesian framework, we treat the parameters of a statistical model as random variables. The model is specified by a prior distribution over the values of the variables, as well as an evidence model which determines how the parameters influence the observed data. When we condition on the observations, we get the posterior distribution over parameters. The term ``Bayesian parameter estimation'' is deceptive, because often we can skip the parameter estimation step entirely. Rather, we integrate out the parameters and directly make predictions about future observables.
This concept has the prerequisites:
- Bayes' rule (Bayes' rule is an important conceptual component of Bayesian parameter estimation.)
- beta distribution (The beta-Bernoulli distribution is an instructive example of Bayesian parameter estimation.)
- conditional distributions (In Bayesian parameter estimation, we need to reason with the conditional distributions over parameters.)
- multivariate distributions (In Bayesian parameter estimation, we need to marginalize out the parameters in order to make predictions.)
- expectation and variance (A common use of the predictive distribution is computing expectations.)
- Know what the terms "prior" and "likelihood function" refer to
- Be able to compute the posterior distribution using Bayes' Rule
- Know what the posterior predictive distribution is and how to compute it analytically for a simple example (e.g. a beta-Bernoulli model)
- What is a conjugate prior, and why is it useful?
- Why can the posterior distribution be given in terms of pseudocounts when a conjugate prior is used?
- What is the maximum a-posteriori (MAP) approximation? Give an example where the predictions differ between the MAP parameters and the posterior predictive distribution.
Core resources (read/watch one of the following)
→ Bayesian Reasoning and Machine Learning
A textbook for a graudate machine learning course.
→ Coursera: Probabilistic Graphical Models (2013)
An online course on probabilistic graphical models.
- maximum likelihood
- If you're not familiar with Bayes nets, don't worry: most of these lectures don't depend on them.
- Click on "Preview" to see the videos.
→ Probabilistic Graphical Models: Principles and Techniques
A very comprehensive textbook for a graduate-level course on probabilistic AI.
Location: Section 17.3, pgs. 733-741
→ Machine Learning: a Probabilistic Perspective
A very comprehensive graudate-level machine learning textbook.
Location: Sections 3.1-3.3, pgs. 65-78
Supplemental resources (the following are optional, but you may find them useful)
→ Coursera: Neural Networks for Machine Learning (2012)
An online course by Geoff Hinton, who invented many of the core ideas behind neural nets and deep learning.
Location: Lecture, "Introduction to the full Bayesian approach"
→ Mathematical Statistics and Data Analysis
An undergraduate statistics textbook.
Location: Section 8.6, "The Bayesian approach to parameter estimation," up through 8.6.1, "Further remarks on priors," pages 285-296
→ Bayesian Data Analysis
A Bayesian statistics textbook with a focus on practical issues.
Location: Sections 2.1-2.3, pgs. 33-39
→ Pattern Recognition and Machine Learning
A textbook for a graduate machine learning course, with a focus on Bayesian methods.
Location: Section 2.1, pgs. 68-74
- Some other parameter estimation methods include:Bayesian decision theory concerns how to act based on our inferences from the data.
- Usually, the posterior over parameters and the predictive distribution can't be computed in closed form. Here are some strategies for getting approximate solutions:
- maximum a posteriori (MAP) estimation , i.e. finding the most likely parameters
- variational Bayes , a framework for approximating intractable posterior distributions with tractable ones
- markov chain Monte Carlo (MCMC) , a set of techniques for approximately sampling from the posterior
- While Bayesian parameter estimation attenuates overfitting, it doesn't solve the problem completely. Strategies for controlling model complexity include:
- conjugate priors , where the prior and posterior have the same functional form
- uninformative priors , which try to say as little as possible about the parameter
- Bayesian model averaging , where we average over multiple choices of the prior
- hierarchical models , where the prior for one problem comes from information obtained from related problems
- eliciting priors from human experts
- create concept: shift + click on graph
- change concept title: shift + click on existing concept
- link together concepts: shift + click drag from one concept to another
- remove concept from graph: click on concept then press delete/backspace
- add associated content to concept: click the small circle that appears on the node when hovering over it
- other actions: use the icons in the upper right corner to optimize the graph placement, preview the graph, or download a json representation