KL Divergence: The Universal Objective in Modern Machine Learning

Models & Research

The Engineer

3 Jun 2024 · 3 min read

At the heart of modern machine learning lies the ubiquitous KL divergence, a measure that transforms abstract theory into practical applications, offering a universal framework for understanding diverse models.

Modern machine learning is a vast landscape of acronyms and initialisms, from VAEs (Variational Autoencoders) to BBB (Bayesian By Backprop), but the more I delve into this field, the clearer it becomes that at its core, most modern methods boil down to one universal objective: minimizing the Kullback-Leibler (KL) divergence. This powerful concept is not just a theoretical curiosity; it's a practical tool with a simple, universal recipe that can help you derive and understand a wide array of machine learning models.

KL Divergence as Expected Weight of Evidence

To appreciate why KL divergence is so fundamental, let’s start with its interpretation as an expected weight of evidence. Imagine you have two hypotheses, ( P ) and ( Q ), and you want to determine which one better explains your data ( D ). The key quantity here is the odds of ( P ) versus ( Q ) given the data:

[ \frac{\Pr(P|D)}{\Pr(Q|D)} ]

Using Bayes' rule, this can be expressed as:

[ \frac{\Pr(P|D)}{\Pr(Q|D)} = \frac{\Pr(D|P)}{\Pr(D|Q)} \cdot \frac{\Pr(P)}{\Pr(Q)} ]

Here, ( \frac{\Pr(D|P)}{\Pr(D|Q)} ) is the likelihood ratio, and ( \frac{\Pr(P)}{\Pr(Q)} ) are the prior odds. Taking the logarithm of both sides simplifies this to:

[ \log \frac{\Pr(P|D)}{\Pr(Q|D)} = \log \frac{\Pr(D|P)}{\Pr(D|Q)} + \log \frac{\Pr(P)}{\Pr(Q)} ]

The term ( \log \frac{\Pr(D|P)}{\Pr(D|Q)} ) is the log-likelihood ratio, which can be interpreted as the weight of evidence provided by the data in favor of hypothesis ( P ) over ( Q ). The KL divergence between ( P ) and ( Q ), denoted ( D_{\text{KL}}(P | Q) ), is the expected value of this log-likelihood ratio under ( P ):

[ D_{\text{KL}}(P | Q) = \mathbb{E}_P \left[ \log \frac{\Pr(D|P)}{\Pr(D|Q)} \right] ]

This interpretation makes KL divergence a natural measure of how much evidence the data provides for one hypothesis over another, making it a cornerstone in probabilistic modeling.

The Universal Recipe

The beauty of KL divergence is that it can be used to derive many well-known machine learning models. Here’s a simple recipe you can follow:

Define the Model: Start by defining your model ( P ) and a reference distribution ( Q ).
Formulate the Objective: Minimize the KL divergence between ( P ) and ( Q ):

[ \min_P D_{\text{KL}}(P | Q) ]

Optimize: Use optimization techniques to find the parameters of ( P ) that minimize this objective.

This recipe is surprisingly versatile. Let’s see how it applies to a few popular models:

Variational Autoencoders (VAEs):
- Model: ( P(x, z) = P(z)P(x|z) ), where ( x ) is the data and ( z ) is the latent variable.
- Reference Distribution: ( Q(z|x) ), an approximate posterior.
- Objective: Minimize ( D_{\text{KL}}(Q(z|x) | P(z)) ).
Diffusion Models:
- Model: A sequence of distributions ( P_t(x_t) ) where ( t ) is the diffusion step.
- Reference Distribution: The data distribution ( P_0(x_0) ).
- Objective: Minimize ( D_{\text{KL}}(P_t(x_t) | P_{t-1}(x_{t-1})) ).
**Bayesian By Backprop (