Correcting Kullback-Leibler divergence for size of datasets

Question

We have the following implementation of KLD:

import numpy as np
import pandas as pd
from scipy.stats import entropy


def KL_divergence(a, b):
    hist_a = np.histogram(a, bins=100, range=(0,1.0))[0]
    hist_b = np.histogram(b, bins=100, range=(0,1.0))[0]
    hist_b = np.where(hist_b == 0.0, 1e-6, hist_b)
    return entropy(hist_a, hist_b)

Which takes two datasets (with range 0-1), discretizes them into 100 equal bins, and calculates KLD on the resulting dataset.

In practice, this does not work at all, because this distance scales hugely with the size of the dataset (smaller dataset = larger distance). Here I wrote a simple script, that simulates many distributions of different sizes data (sizes 100, 1000, 10000), evaluates KLD, and plots each histogram. The "underlying probability" is an example distribution those datasets might follow.

import numpy as np
import pandas as pd
from scipy.stats import entropy
import matplotlib.pyplot as plt
%matplotlib inline


def KL_divergence(hist_a, hist_b):
    return entropy(hist_a, hist_b)

actual_bin_counts = np.array([7805,   436,   396,   456,   559,   809,  1139,  1928,  4618, 60948])
underlying_probability = actual_bin_counts / actual_bin_counts.sum()

def generate_histogram(n_samples, true_probs = underlying_probability):
    uniform_random = np.random.uniform(0,1, size=n_samples)
    bins_counts = np.digitize(uniform_random, underlying_probability.cumsum())
    return np.unique(bins_counts, return_counts=True)[1]

distances_1000 = []
for repeat in range(10_000):
    try:
        sampled_a = generate_histogram(1000)
        sampled_b = generate_histogram(1000)
        distances_1000.append(KL_divergence(sampled_a, sampled_b))
    except:
        # we had a category with 9 bins. I don't care enough to fix it.
        pass

distances_10_000 = []
for repeat in range(10_000):
    try:
        sampled_a = generate_histogram(10_000)
        sampled_b = generate_histogram(10_000)
        distances_10_000.append(KL_divergence(sampled_a, sampled_b))
    except:
        # we had a category with 9 bins. I don't care enough to fix it.
        pass

distances_100_000 = []
for repeat in range(10_000):
    try:
        sampled_a = generate_histogram(100_000)
        sampled_b = generate_histogram(100_000)
        distances_100_000.append(KL_divergence(sampled_a, sampled_b))
    except:
        # we had a category with 9 bins. I don't care enough to fix it.
        pass

plt.xscale('log')
plt.hist(distances_1000, bins=100);
plt.hist(distances_10_000, bins=100);
plt.hist(distances_100_000, bins=100);

As you can see, while the underlying distributions are the same, the distances are incomparable. How do I correct for the size of the datasets?

jbowman · Accepted Answer · 2018-09-30T19:14:14.893

The fundamental issue is that the KL divergence between the true underlying distributions is zero, as they are the same in your code ($U(0,1)$,) but sampling variation (almost) ensures that in finite samples the KL divergence between the two empirical distributions will be positive, as the empirical distributions will not be exactly equal. Since the empirical distributions converge (uniformly) to the true distributions as the sample size goes to infinity, the sample KL divergence goes to its true value almost surely as the sample size $\rightarrow \infty$, which causes your histograms to shift closer and closer to zero as the sample size increases.

If you look at where the histograms are centered (roughly) on the x-axis, you'll see that the histogram for $n=100,000$ is located at about $1/100^{th}$ of where the histogram for $n=1000$ is located ($3\times 10^{-4}$ vs. $3\times 10^{-2}$, approximately.) The ratio of the sample sizes is, not coincidentally, $100-1$. The same effect can also be seen with respect to the $n=10,000$ histogram compared to the other two.

Note that binning into a constant number of bins would not in general allow the KL divergence to approach the true value in cases where the two underlying distributions were not the same, instead, convergence would be to the true value of the KL divergence between the discrete distributions formed in the obvious way from the underlying continuous distributions and the bin boundaries, so the convergence to the true value in this case is a happy coincidence brought on by the way you wrote the code.

The issue with convergence to the true value isn't due to binning, but due to a *constant number* of bins. If the number of bins $\rightarrow \infty$ as $n \rightarrow \infty$ you'll still get convergence. I can't remember what the optimal rate of increase of the number of bins is, I have $\sqrt[3]{n}$ in mind for some reason, but don't rely on that. Other alternatives are to fit mixture models, e.g., mixtures of gammas for positive variates with absolutely continuous distributions, with the # of mixtures $\rightarrow \infty$ as well, or nonparametric densities, or others. — jbowman, Oct 01 '18 at 16:00

score 2 · Answer 2 · answered Sep 23 '18 at 23:22

The KL divergence doesn't really produce smaller distances with larger datasets or vice-versa. In your example, the distances are incomparable because of the sampling step in your code (in generate_histogram).

Essentially, when you use that function to generate a probability mass function with 100 data points, there's quite a bit of sampling uncertainty that's fed into the KL divergence. For example, here are a few realisations that I got when I ran that function with a 100 data points:

generate_histogram(100)/100.

# array([0.09, 0.02, 0.01, 0.03, 0.02, 0.03, 0.02, 0.01, 0.77])
# array([0.06, 0.02, 0.01, 0.04, 0.06, 0.81])
# array([0.09, 0.01, 0.01, 0.01, 0.01, 0.01, 0.04, 0.04, 0.78])
# array([0.09, 0.03, 0.01, 0.03, 0.03, 0.09, 0.72])
# array([0.08, 0.01, 0.01, 0.04, 0.86])
# array([0.09, 0.02, 0.01, 0.01, 0.01, 0.01, 0.02, 0.09, 0.74])

(By the way, you should check your implementation here, as the arrays are not of equal lengths - i.e. some bin probabilities are zero and these are ignored, and I don't know what scipy does with unequal array lengths. This isn't an issue with the other two sections with higher data points as the probabilities usually aren't zero for any block. I suspect that this is the reason for why the first histogram is so skewed whereas the others aren't.)

When you call generate_histogram with 10k data points, on the other hand, there's not much sampling uncertainty in the probabilities and every probability vector that's fed into the KL divergence function looks the same:

np.round(generate_histogram(10000)/10000., 2)
# array([0.1 , 0.  , 0.  , 0.01, 0.01, 0.01, 0.01, 0.03, 0.06, 0.77])
# array([0.1 , 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.02, 0.06, 0.77])
# array([0.1 , 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.02, 0.06, 0.77])
# array([0.1 , 0.01, 0.  , 0.01, 0.01, 0.01, 0.01, 0.02, 0.06, 0.77])
# array([0.1 , 0.  , 0.01, 0.01, 0.01, 0.01, 0.02, 0.02, 0.06, 0.77])

Since every realisation is the same to 2 d.p., the KL divergence reflects this and is basically around 0 most times.

I speculate that your very first function (KL_divergence(a, b)) doesn't work well because, for smaller datasets, 100 bins is massive. The issue with using 100 bins with a dataset of a comparable size is that there'll be ones, twos and zeros in most bins, and there'll be significant sampling uncertainty, which might lead you to think that the KL divergence is affected by the size of the datasets, when in reality, it's only the number of bins and the sampling uncertainty of the probabilities that affect it.

I do not have much experience with the KL divergence but it is only a function of random variables after all, and you should be able to account for the uncertainty in its estimation, based on your specific use case.

For example, if you're working with different size datasets, you would be able to bootstrap, i.e.:

Bootstrap (sample with replacement) from both datasets $D_1$ and $D_2$ to obtain bootstrapped datasets $D_1^*$ and $D_2^*$.
Obtain an estimate of the ECDF by the .cumsum() as you've done already.
Calculate the KL divergence for this set of bootstrapped data.
Repeat.

This way, you account for the difference in datasets by obtaining sampling uncertainty around the KL divergence.

If you're only looking for a point estimate, you can simply take the mode of the bootstrapped distribution.

If, for whatever reason, you want to feed in an equal number of data points to the KL divergence function (perhaps to utilize a higher number of bins), you could do kernel density estimation on both of your datasets, sample a lot of values from those densities and use the KL divergence on the resulting probability vector. This wouldn't really account for sampling uncertainty, but it would allow you to get over implementation issues.

Hey, I updated the code and picture for larger datasets. As you can see, especially in the new picture, those distances seem to follow some very specific formula, where the log of the distance has a linear relationship with the log of the amount of data. I could bootstrap my datasets, but that's slow, and it seems like the amount of noise inserted per "binning" should be measureable, and therefore corrected for in a much cleaner way. I could just hack the values for the log-log relationship, but it's less likely I'll make a mistake if I know the exact formula. — Hristo Buyukliev, Sep 24 '18 at 01:18
I'm not entirely sure if it's possible to "adjust" for these errors (in any scenario, all you've got is one ECDF for either dataset). Using the CLT, the sample mass functions follow: $$\hat p(x) \sim \mathcal N \left( p(x), \frac{1}{n} p(x)(1-p(x)) \right)$$ From that, I'd speculate that the KL-divergence is biased (I'm not even sure if there exists a consistent estimator for the KL div). $$KL = -\sum_i^{bins} \hat p(x_i) log \left( \frac{\hat q(x_i)}{\hat p(x_i)} \right)$$ The distribution of this isn't trivial, because of the log. The bootstrap is probably a better estimator for error. — adityar, Sep 24 '18 at 09:12

Correcting Kullback-Leibler divergence for size of datasets

2 Answers2