We have the following implementation of KLD:
import numpy as np
import pandas as pd
from scipy.stats import entropy
def KL_divergence(a, b):
hist_a = np.histogram(a, bins=100, range=(0,1.0))[0]
hist_b = np.histogram(b, bins=100, range=(0,1.0))[0]
hist_b = np.where(hist_b == 0.0, 1e-6, hist_b)
return entropy(hist_a, hist_b)
Which takes two datasets (with range 0-1), discretizes them into 100 equal bins, and calculates KLD on the resulting dataset.
In practice, this does not work at all, because this distance scales hugely with the size of the dataset (smaller dataset = larger distance). Here I wrote a simple script, that simulates many distributions of different sizes data (sizes 100, 1000, 10000), evaluates KLD, and plots each histogram. The "underlying probability" is an example distribution those datasets might follow.
import numpy as np
import pandas as pd
from scipy.stats import entropy
import matplotlib.pyplot as plt
%matplotlib inline
def KL_divergence(hist_a, hist_b):
return entropy(hist_a, hist_b)
actual_bin_counts = np.array([7805, 436, 396, 456, 559, 809, 1139, 1928, 4618, 60948])
underlying_probability = actual_bin_counts / actual_bin_counts.sum()
def generate_histogram(n_samples, true_probs = underlying_probability):
uniform_random = np.random.uniform(0,1, size=n_samples)
bins_counts = np.digitize(uniform_random, underlying_probability.cumsum())
return np.unique(bins_counts, return_counts=True)[1]
distances_1000 = []
for repeat in range(10_000):
try:
sampled_a = generate_histogram(1000)
sampled_b = generate_histogram(1000)
distances_1000.append(KL_divergence(sampled_a, sampled_b))
except:
# we had a category with 9 bins. I don't care enough to fix it.
pass
distances_10_000 = []
for repeat in range(10_000):
try:
sampled_a = generate_histogram(10_000)
sampled_b = generate_histogram(10_000)
distances_10_000.append(KL_divergence(sampled_a, sampled_b))
except:
# we had a category with 9 bins. I don't care enough to fix it.
pass
distances_100_000 = []
for repeat in range(10_000):
try:
sampled_a = generate_histogram(100_000)
sampled_b = generate_histogram(100_000)
distances_100_000.append(KL_divergence(sampled_a, sampled_b))
except:
# we had a category with 9 bins. I don't care enough to fix it.
pass
plt.xscale('log')
plt.hist(distances_1000, bins=100);
plt.hist(distances_10_000, bins=100);
plt.hist(distances_100_000, bins=100);
As you can see, while the underlying distributions are the same, the distances are incomparable. How do I correct for the size of the datasets?