KL divergence
Summary
KL divergence, roughly speaking, is a measure of the distance between two probability distributions P and Q, and corresponds to the number of extra bits required to encode samples from P using an optimal code for Q. It is not truly a distance function, because it's not symmetric and it doesn't satisfy the triangle inequality. Despite this, it's widely used in information theory and probabilistic inference.
Context
This concept has the prerequisites:
- entropy (KL divergence is closely related to entropy.)
- expectation and variance (KL divergence is defined as an expectation.)
Goals
- Know the definition of KL divergence.
- Derive some basic properties:
- that it is nonnegative
- that the KL divergence between a distribution and itself is 0
- Show that it is not a true distance metric because
- it is not symmetric
- it doesn't satisfy the triangle inequality
Core resources (we're sorry, we haven't finished tracking down resources for this concept yet)
Supplemental resources (the following are optional, but you may find them useful)
-Paid-
→ Elements of Information Theory
A graduate level textbook on information theory.
- Section 2.3, "Relative entropy and mutual information," pages 19-20
- Section 2.6, "Jensen's inequality and its consequences," up to Theorem 2.6.3, pages 25-28
→ Pattern Recognition and Machine Learning
A textbook for a graduate machine learning course, with a focus on Bayesian methods.
Location:
Section 1.6.1, "Relative entropy and mutual information," pages 55-58
→ Probabilistic Graphical Models: Principles and Techniques
A very comprehensive textbook for a graduate-level course on probabilistic AI.
Location:
Section A.1.3.1, "Relative entropy," page 1139
See also
- Variational Bayes is a class of approximate inference algorithms which try to minimize the KL divergence between distributions.
- KL divergence is a special case of more general families of divergences: KL divergence is locally approximated by the [Fisher information metric](fisher_information_metric) .
- Mutual information can be [defined](mutual_information_and_kl_divergence) in terms of KL divergence.