Estimate the Kullback–Leibler (KL) divergence with Monte Carlo

Question

I want to estimate the KL divergence between two continuous distributions $f$ and $g$. However, I can't write down the density for either $f$ or $g$. I can sample from both $f$ and $g$ via some method (for example, Markov chain Monte Carlo).

The KL divergence from $f$ to $g$ is defined like this:

$$\operatorname{D_{\mathrm{KL}}}(f \parallel g) = \int_{-\infty}^{\infty} f(x) \log\left(\frac{f(x)}{g(x)}\right) \operatorname{d}x$$

This is the expectation of $\log\left(\frac{f(x)}{g(x)}\right)$ with respect to $f$, so you could imagine some Monte Carlo estimate

$$\frac{1}{N}\sum_{i=1}^N \log\left(\frac{f(x_i)}{g(x_i)}\right)$$

where $i$ indexes $N$ samples that are drawn from $f$ (i.e. $x_i \sim f()$ for $i = 1, \ldots, N$).

However, since I don't know $f()$ and $g()$, I can't even use this Monte Carlo estimate. What is the standard way of estimating the KL in this situation?

EDIT: I do NOT know the unnormalized density for either $f()$ or $g()$.

this will work but it can be arbitrarily slow for hard choice of f and g (close, or close tails). If you decide to ignore samples away from tails then you might have more luck with upper bounding the roc. — Christian Chapman, May 24 '17 at 05:39
Essentially a duplicate: https://stats.stackexchange.com/questions/211175/kullback-leibler-divergence/248657#248657 — kjetil b halvorsen, May 24 '17 at 08:21

Taylor · Answer 1 · 2018-09-12T15:57:23.047

I assume you can evaluate $f$ and $g$ up to a normalizing constant. Denote $f(x) = f_u(x)/c_f$ and $g(x) = g_u(x)/c_g$.

A consistent estimator that may be used is $$ \widehat{D_{KL}}(f || g) = \left[n^{-1} \sum_j f_u(x_j)/\pi_f(x_j)\right]^{-1}\frac{1}{N}\sum_i^N \left[\log\left(\frac{f_u(z_i)}{g_u(z_i)}\right)\frac{f_u(z_i)}{\pi_r(z_i)}\right] - \log (\hat{r}) $$ where $$ \hat{r} = \frac{1/n}{1/n}\frac{\sum_j f_u(x_j)/\pi_f(x_j)}{\sum_j g_u(y_j)/\pi_g(y_j)} \tag{1}. $$ is an importance sampling estimator for the ratio $c_f/c_g$. Here you use $\pi_f$ and $\pi_g$ as instrumental densities for $f_u$ and $g_u$ respectively, and $\pi_r$ to target the log ratio of unnormalized densities.

So let $\{x_i\} \sim \pi_f$, $\{y_i\} \sim \pi_g$, and $\{z_i\} \sim \pi_r$. The numerator of (1) converges to $c_f$. The denominator converges to $c_g$. The ratio is consistent by the continuous mapping theorem. The log of the ratio is consistent by continuous mapping again.

Regarding the other part of the estimator, $$ \frac{1}{N}\sum_i^N \left[\log\left(\frac{f_u(z_i)}{g_u(z_i)}\right)\frac{f_u(z_i)}{\pi_r(z_i)}\right] \overset{\text{as}}{\to} c_f E\left[ \log\left(\frac{f_u(z_i)}{g_u(z_i)}\right) \right] $$ by the law of large numbers.

My motivation is the following:

\begin{align*} D_{KL}(f || g) &= \int_{-\infty}^{\infty} f(x) \log\left(\frac{f(x)}{g(x)}\right) dx \\ &= \int_{-\infty}^{\infty} f(x)\left\{ \log \left[\frac{f_u(x)}{g_u(x)} \right] + \log \left[\frac{c_g}{c_f} \right]\right\} dx \\ &= E_f\left[\log \frac{f_u(x)}{g_u(x)} \right] + \log \left[\frac{c_g}{c_f} \right] \\ &= c_f^{-1} E_{\pi_r}\left[\log \frac{f_u(x)}{g_u(x)}\frac{f_u(x)}{\pi_r(x)} \right] + \log \left[\frac{c_g}{c_f} \right]. \end{align*} So I just break it up into tractable pieces.

For more ideas on how to simulate the likelhood ratio, I found a paper that has a few: https://projecteuclid.org/download/pdf_1/euclid.aos/1031594732

(+1) It's worth noting here that importance sampling can have extremely high variance (even infinite variance) if the target distribution has fatter tails than the distribution you're sampling from and/or the number of dimensions is at all large. — David J. Harris, May 24 '17 at 18:22

score 7 · Answer 2 · answered May 24 '17 at 06:38

Here I assume that you can only sample from the models; an unnormalized density function is not available.

You write that

$$D_{KL}(f || g) = \int_{-\infty}^{\infty} f(x) \log\left(\underbrace{\frac{f(x)}{g(x)}}_{=: r}\right) dx,$$

where I have defined the ratio of probabilities to be $r$. Alex Smola writes, although in a different context that you can estimate these ratios "easily" by just training a classifier. Let us assume you have obtained a classifier $p(f|x)$, which can tell you the probability that an observation $x$ has been generated by $f$. Note that $p(g|x) = 1 - p(f|x)$. Then:

$$r = \frac{p(x|f)}{p(x|g)} \\ = \frac{p(f|x) {p(x) p(g)}}{p(g|x)p(x) p(f)} \\ = \frac{p(f|x)}{p(g|x)},$$

where the first step is due to Bayes and the last follows from the assumption that $p(g) = p(f)$.

Getting such a classifier can be quite easy for two reasons.

First, you can do stochastic updates. That means that if you are using a gradient-based optimizer, as is typical for logistic regression or neural networks, you can just draw a samples from each $f$ and $g$ and make an update.

Second, as you have virtually unlimited data–you can just sample $f$ and $g$ to death–you don't have to worry about overfitting or the like.

Why is $p(g) = p(f)$ ? – Jonathan Rayner Feb 11 '22 at 17:59 — Jonathan Rayner, Feb 11 '22 at 17:59

score 0 · Answer 3 · answered Aug 10 '19 at 06:14

Besides the probabilistic classifier method mentioned by @bayerj, you can also use the lower bound of the KL divergence derived in [1-2]:

$$ \mathrm{KL}[f \Vert g] \ge \sup_{T} \left\{ \mathbb{E}_{x\sim f}\left[ T(x) \right] - \mathbb{E}_{x\sim g} \left[ \exp \left( T(x) - 1 \right)\right] \right\}, $$ where $T:\mathcal{X}\to\mathbb{R}$ is an arbitrary function. Under some mild conditions, the bound is tight for: $$T(x) = 1 + \ln \left[ \frac{f(x)}{g(x)} \right]$$

To estimate KL divergence between $f$ and $g$, we maximize the lower bound w.r.t. to the function $T(x)$.

References:

[1] Nguyen, X., Wainwright, M.J. and Jordan, M.I., 2010. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11), pp.5847-5861.

[2] Nowozin, S., Cseke, B. and Tomioka, R., 2016. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in neural information processing systems (pp. 271-279).

score 0 · Answer 4 · edited Mar 28 '21 at 23:20

Here is a python implementation for KL estimation for gaussian samples vs. close form calculation:

import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import norm

def kl_divergence(p, mu1, sigma1, mu2, sigma2):

    return (1.0 / len(p)) * sum(np.log(norm.pdf(p[i], mu1, sigma1)) - 
        np.log(norm.pdf(p[i],mu2, sigma2)) for i in range(len(p)))

def kl_divergence_cf(mu1, sigma1, mu2, sigma2):
    return (np.log(sigma2/sigma1) + (np.power(sigma1, 2) + 
        (np.power((mu1-mu2), 2)))/(2*np.power(sigma2, 2)) - 0.5)

def kl_divergence_js(p, q):

    return (1.0 / len(p)) * sum(np.log(p[i]) - 
        np.log(norm.pdf(q[i], mu2, sigma2)) for i in range(len(p)))

######## Sampling ########

mu1, sigma1 = 2, 1 # mean and standard deviation
p = np.random.normal(mu1, sigma1, 10000)

mu2, sigma2 = 3, 0.5 # mean and standard deviation

q = np.random.normal(mu2, sigma2, 10000)

######## Close Form - KL Divergence ########

print("Close Form kl_div : " + 
    str(kl_divergence_cf(mu1, sigma1, mu2, sigma2)))

######## MC Sampling - KL Divergence ########

print("Monte Carlo Estimation kl_div : " + 
    str(kl_divergence(p.tolist(), mu1, sigma1, mu2, sigma2)))

######## Plotting ########

count, bins, ignored = plt.hist(p, 30, density = True)
plt.plot(bins, 1/(sigma1 * np.sqrt(2 * np.pi)) *
               np.exp(- (bins - mu1)**2 / (2 * sigma1**2) ),
         linewidth = 2, color = 'r')

count, bins, ignored = plt.hist(q, 30, density=True)
plt.plot(bins, 1/(sigma2 * np.sqrt(2 * np.pi)) *
               np.exp( - (bins - mu2)**2 / (2 * sigma2**2) ),
         linewidth = 2, color = 'r')
plt.show()

Estimate the Kullback–Leibler (KL) divergence with Monte Carlo

4 Answers4