How to determine the similarity between histograms

Question

The questions is asked at How to assess the similarity of two histograms?

However, I'm not sure if it's the same question as mine, so I'd like to ask again.

I have a dataset, which covers a long time span. So it is possible that in some years, the data are corrupted. For example, I plot the histogram by 5 years:

Look at the right plots, it is apparent that the distribution of the last one is very different from the above.

But how can I qualify this difference, and be able to determine it by code?

I know in How to assess the similarity of two histograms?, the answer is 2 sample gof test, but is it proper for my case? I have several plots to compare at the same time.

score 2 · Answer 1 · answered May 10 '16 at 07:19

It really depends on the questions you want to tackle. If you are concerned with similarity, you may use the cosine similarity, that is, you normalize the histograms, and calculate its scalar product which gives you a measure of how aligned those histograms are.

But if you rather want to see if two histograms are significantly different, you may use the Kolmogorov-Smirnov test, as pointed out in the answer you refer to.

score 2 · Answer 2 · answered May 10 '16 at 07:48

Two ways to do this are through a QQ plot & a Kolmogorov-Smirnov test.

Since you are looking for a programatic way to determine and quanitfy the difference of two distributions, i'd recommend a KS test. A KS test is nice in that you can easily get a p-value out of the test and it is very easy to run. Here's the example python code:

import numpy as np
from scipy.stats import ks_2samp

x = np.random.normal(0,1,1000)
y = np.random.normal(0,1,1000)

ks_2samp(x, y)
(0.022999999999999909, 0.95189016804849658)
# p-value = .95 (no significant difference)

But which one should I use as a standard? There are several plots to compare in this case. Maybe I should make a score matrix? — cqcn1991, May 19 '16 at 08:13

How to determine the similarity between histograms

2 Answers2