I am trying to learn if there's a way to quantify how close two discrete histograms that represent probability distributions (normalized) are.
To give an example, I generate two lists of integers in the range of $1$ to $10,$ of difference sizes $n_1$ and $n_2.$ As two first measures that I have tried, namely, KolmogorovSmirnovTest
and PearsonChiSquareTest
, the results are inconsistent as they are well below $1$ even for two nearly exact distributions (uniform examples), specially, as soon as the sample sizes are largely different. Here's my example in Mathematica:
n1 = 90000;
n2 = 20000;
SeedRandom[100];
ls1 = RandomInteger[{1, 10}, n1];
SeedRandom[101];
ls2 = RandomInteger[{1, 10}, n2];
hist1 = Histogram[ls1, {1}, "Probability",
AxesLabel -> {"value", "probability"}, ChartStyle ->
{Yellow},
ChartLegends -> {"List 1"}];
hist2 = Histogram[ls2, {1}, "Probability",
AxesLabel -> {"value", "probability"},
ChartStyle -> {Directive[Red, Opacity[0.5]]},
ChartLegends -> {"List 2"}];
a) Sample sizes n1=90000
and n2=20000
very different:
KolmogorovSmirnovTest[ls1, ls2]
PearsonChiSquareTest[ls1, ls2]
> 0.603708
> 0.389257
b) Sample sizes n1=30000
and n2=20000
more comparable:
KolmogorovSmirnovTest[ls1, ls2]
PearsonChiSquareTest[ls1, ls2]
> 0.999966
> 0.993693
which are closer to the (intuitively) expected values. But case a) leads to much lower estimates, and well below $1$.
However, when visually comparing the histograms, they are nearly perfectly overlapping and almost equally uniform in both cases:
Applying these measures on hist1
and hist2
is not possible, but trying the following (i.e. applying the measures on the probabilities of each bin):
KolmogorovSmirnovTest[HistogramList[ls1, {1},
"Probability"][[2]],
HistogramList[ls2, {1}, "Probability"][[2]]]
PearsonChiSquareTest[HistogramList[ls1, {1},
"Probability"][[2]],
HistogramList[ls2, {1}, "Probability"][[2]]]
> 0.417524
>
> 0.223235
leads to similar results, that is, well below $1.$
Since trying the above, I have learned that the
KolmogorovSmirnovTest
is not valid for discrete distributions. Are there measures for quantifying how close two discrete probability histograms are? Would for instance, the percentage of overlap between them (assuming same binning) qualify as a meaningful measure?The data I am trying to examine is very much like in the above example, as in, the sample sizes of the two lists are very different. Is there a way the similarity of the histograms can be compared irrespective of the sample sizes?
Attempt to explain better the intent:
One of the histograms is given as input, and I am trying to create a dataset (list of bounded integers here) whose histogram behaves like the prescribed one. In the ideal case, if the data histograms (when plotted like the above) are not discernible (good convergence), then I can say my data is characterised by the given histogram. But to get there, at each step I wanted to know if I could quantify how far I still am from the target dist. In the example above, for illustration, I create both histograms myself, but one can imagine e.g. the 2nd one is given.