0

I want to compare the distribution of 2 independant datasets. Measurements were performed on an experimental dataset (TEST) and compared with a completely independant reference dataset (REF). The idea is to determine if the measurements in the experimental dataset follow the same distribution as the reference.

I looked at the Kolmogorov-Smirnov test (two-sided) but I am not sure it does exactly what I think it does.

Can anyone suggest a test more appropriate to test the difference/similarity in distribution of these 2 datasets?

Thanks!

enter image description here

Seb Matamoros
  • 123
  • 1
  • 4
  • What is your objection to Kolmogorov-Smirnov? – Dave Aug 15 '19 at 15:23
  • You could measure the distance between the two distributions using Kullback Leibler divergence to get a better feeling for the dissimilarity and then use the Kolmogorov-Smirnov test just for computing the p-value. – resnet Aug 15 '19 at 15:27
  • Try also the Baumgartner-Weiss-Schindler test, _cf_ https://cran.r-project.org/web/packages/BWStest/index.html – steveo'america Aug 15 '19 at 16:23

1 Answers1

1

I still want to hear your reasons for doubting KS as an appropriate method, but now that I've looked at your graph more, I say that KS does not apply. Your data are discrete, and KS does not apply to data drawn to discrete distributions. However, you could use a chi-squared test! I wrote about this yesterday. Instead of checking if frequencies match the frequencies expected from a fair die, you'd be checking if the TEST frequencies match the REF frequencies.

Just looking at the graph, the answer is that the distributions are different. There are almost 400 blue observations (seems like it's about 30% of the blue observations) at 0.08, yet red does not get even one instance of 0.08.

For a discussion of KS on discrete distributions: Is Kolmogorov-Smirnov test valid with discrete distributions?

My description of the chi-squared test (turns out to be unrelated to skewness): How to identify if my data set is skewed or not?

Dave
  • 28,473
  • 4
  • 52
  • 104
  • Thanks! The data measured is biological, so in theory it is continous: any sample can have a measurement between 0 and 512 (even theoretically more); but the method used for the measurements only records values based on the exponential function. Additionally we know that, due to biological (genetic) constrains, there are more samples showing values around 0.016, 0,25 and 32. I will use a chi-square, seems more appropriate. – Seb Matamoros Aug 19 '19 at 09:44
  • Do you have ties in your data? The graph makes it look like you do, but your comment makes it sound like you don’t. – Dave Aug 19 '19 at 10:35
  • Yes, the test set has several "0" counts for the low and high values, which means I could not use the KS test on the whole range of measured values. – Seb Matamoros Aug 20 '19 at 09:54
  • Then it sounds like most of your distribution is continuous, so chi-squared would not apply. I’m curious...what process is giving you data, microarray? – Dave Aug 20 '19 at 10:03
  • The data represented is minimum inhibitory concentration (MIC), a measure of bacterial antibiotic susceptibility, and obtained using a method called microbroth dilution. The antibiotic is diluted 2X for each measurement point (hence the exponential function). We record how many bacterial strains can resist each concentration. Looking more into the chi-square (I don't use it often), I am not sure if it's appropriate indeed. – Seb Matamoros Aug 20 '19 at 10:17
  • I agree that $\chi^2$ would not be the right test for you. – Dave Aug 20 '19 at 10:21