# KL divergence

## Summary

KL divergence, roughly speaking, is a measure of the distance between two probability distributions P and Q, and corresponds to the number of extra bits required to encode samples from P using an optimal code for Q. It is not truly a distance function, because it's not symmetric and it doesn't satisfy the triangle inequality. Despite this, it's widely used in information theory and probabilistic inference.

## Context

This concept has the prerequisites:

- entropy (KL divergence is closely related to entropy.)
- expectation and variance (KL divergence is defined as an expectation.)

## Goals

- Know the definition of KL divergence.

- Derive some basic properties:
- that it is nonnegative
- that the KL divergence between a distribution and itself is 0

- Show that it is not a true distance metric because
- it is not symmetric
- it doesn't satisfy the triangle inequality

## Core resources (we're sorry, we haven't finished tracking down resources for this concept yet)

## Supplemental resources (the following are optional, but you may find them useful)

## -Paid-

→ Elements of Information Theory

A graduate level textbook on information theory.

- Section 2.3, "Relative entropy and mutual information," pages 19-20
- Section 2.6, "Jensen's inequality and its consequences," up to Theorem 2.6.3, pages 25-28

→ Pattern Recognition and Machine Learning

A textbook for a graduate machine learning course, with a focus on Bayesian methods.

Location:
Section 1.6.1, "Relative entropy and mutual information," pages 55-58

→ Probabilistic Graphical Models: Principles and Techniques

A very comprehensive textbook for a graduate-level course on probabilistic AI.

Location:
Section A.1.3.1, "Relative entropy," page 1139

## See also

- Variational Bayes is a class of approximate inference algorithms which try to minimize the KL divergence between distributions.
- KL divergence is a special case of more general families of divergences: KL divergence is locally approximated by the [Fisher information metric](fisher_information_metric) .
- Mutual information can be [defined](mutual_information_and_kl_divergence) in terms of KL divergence.