What Is KL Divergence? Probability Distributions Explained

KL divergence (Kullback-Leibler divergence) is a measure of how different one probability distribution is from another. It answers a specific question: if you have a “true” distribution of data and an approximation of it, how much information do you lose by using the approximation instead? The result is always a number zero or greater, where zero means the two distributions are identical.

Also called “relative entropy,” KL divergence shows up constantly in machine learning, statistics, and information theory. If you’re training a neural network, compressing data, or comparing statistical models, KL divergence is likely working somewhere under the hood.

The Core Idea Behind KL Divergence

Imagine you have two descriptions of reality. One is the true distribution of some data, call it P. The other is your model’s best guess at that distribution, call it Q. KL divergence measures the extra “surprise” or wasted information you’d experience if you relied on Q when the world actually follows P.

Think of it in terms of encoding messages. If you knew the true distribution P, you could design a perfectly efficient code to transmit data with the fewest possible bits. But if you mistakenly design your code based on Q instead, you’ll use more bits than necessary. KL divergence tells you exactly how many extra bits per message that mistake costs you.

This is why the value can never be negative. You can never do better than the true distribution. If your approximation Q happens to perfectly match P, the extra cost is zero. Any difference at all pushes the value above zero, and the worse your approximation, the higher the divergence climbs.

How It Relates to Entropy and Cross-Entropy

KL divergence fits neatly into a family of information-theoretic quantities. Shannon entropy, H(P), measures the inherent uncertainty in a distribution. It’s the minimum average number of bits needed to encode outcomes drawn from P. Cross-entropy, H(P, Q), measures the average bits needed when you encode outcomes from P using a code optimized for Q.

The relationship is straightforward: KL divergence equals cross-entropy minus entropy.

KL(P || Q) = H(P, Q) – H(P)

Cross-entropy is always at least as large as entropy (since using the wrong code can’t be more efficient than using the right one). The gap between them is the KL divergence: the penalty for using the wrong distribution. This is why minimizing cross-entropy in a machine learning model is equivalent to minimizing KL divergence. The entropy of the true data is fixed, so reducing cross-entropy directly reduces the divergence between your model and reality.

The Formula for Discrete and Continuous Cases

For discrete random variables where P and Q assign probabilities to a set of outcomes, KL divergence is calculated as:

KL(P || Q) = Σ p(x) · ln( p(x) / q(x) )

You loop through every possible outcome x, take the ratio of the two probabilities, apply a logarithm, and weight it by p(x). The log can use any base (natural log, log base 2, etc.), which just changes the units. Base 2 gives you the result in bits; natural log gives it in “nats.”

For continuous distributions, the sum becomes an integral:

KL(P || Q) = ∫ p(x) · ln( p(x) / q(x) ) dx

Some common cases have clean closed-form solutions. For two bell curves (Gaussian distributions) that share the same spread but have different centers, the KL divergence simplifies to just the squared difference between the two means, divided by twice the variance. The farther apart the peaks, the greater the divergence, scaled by how wide the distributions are.

Why It’s Not a True Distance

One of the most important things to understand about KL divergence is that it is not a distance metric, even though people sometimes loosely call it one. It fails two key requirements that any true distance measure must satisfy.

First, it’s asymmetric. KL(P || Q) does not equal KL(Q || P). Measuring how much Q diverges from P gives a different answer than measuring how much P diverges from Q. Intuitively, using a narrow approximation when reality is broad (missing the tails) is a very different kind of mistake than using a broad approximation when reality is narrow (wasting probability on things that rarely happen).

Second, it doesn’t obey the triangle inequality, a property that says going from A to C directly should never cost more than going from A to B to C. KL divergence makes no such guarantee.

It can also blow up to infinity. If there’s any outcome where Q assigns zero probability but P assigns positive probability, the logarithm in the formula diverges. This makes practical computation tricky when your model assigns zero likelihood to events that actually occur.

Jensen-Shannon Divergence as an Alternative

Because of KL divergence’s asymmetry and potential to explode to infinity, researchers developed the Jensen-Shannon divergence (JSD) as a more well-behaved alternative. JSD averages the KL divergence in both directions, using the midpoint of the two distributions as a reference:

JSD(P, Q) = ½ · KL(P || M) + ½ · KL(Q || M), where M = (P + Q) / 2

This fixes both problems at once. JSD is perfectly symmetric, so the order doesn’t matter. It’s also always bounded between zero and log 2, so it never runs off to infinity. And unlike KL divergence, it works even when the two distributions don’t share the same support (meaning one assigns probability to outcomes the other doesn’t). These properties made JSD particularly useful in training generative adversarial networks, where comparing distributions with mismatched supports is common.

KL Divergence in Machine Learning

KL divergence plays a central role in training variational autoencoders (VAEs), one of the most widely used generative models. A VAE learns to compress data into a compact set of latent variables and then reconstruct it. The training objective, called the evidence lower bound (ELBO), has two competing terms: one that rewards accurate reconstruction and one that penalizes the latent variables for straying too far from a simple prior distribution (typically a standard bell curve with mean zero and unit variance). That second term is a KL divergence.

This KL penalty serves as regularization. Without it, the model could learn a chaotic, fragmented internal representation that memorizes training data but generates garbage for new inputs. The KL term forces the latent space to stay organized and smooth, which is what allows you to sample new points from it and get coherent outputs. The balance between reconstruction quality and this KL regularization term is one of the core tensions in VAE design.

Beyond VAEs, KL divergence appears whenever you need to compare probability distributions. Language models use cross-entropy loss, which as noted above is equivalent to minimizing KL divergence against the true distribution of text. Knowledge distillation, where a smaller model learns to mimic a larger one, often uses KL divergence to match the smaller model’s output probabilities to the larger model’s. Reinforcement learning from human feedback (RLHF) uses a KL penalty to keep a fine-tuned model from drifting too far from its base version.

KL Divergence as Information Gain

In Bayesian statistics, KL divergence has an elegant interpretation: it measures how much you learned from data. Before seeing data, you have a prior distribution representing your initial beliefs about some parameter. After observing data, you update to a posterior distribution. The KL divergence from the prior to the posterior quantifies exactly how much information the data contributed, measured in bits or nats.

This interpretation is practically useful. Researchers have used KL-based rankings to prioritize which model parameters gained the most information from experimental data, letting them focus analysis on the parameters where the data was most informative rather than searching through all of them at random. A high KL divergence between prior and posterior signals that the data substantially changed your beliefs, while a low value means the data told you little you didn’t already know.