5

I have distributions from two different data sets and I would like to measure how similar their distributions (in terms of their bin frequencies) are. In other words, I am not interested in the correlation of data point sequences but rather in the their distributional properties with respect to similarity. Currently I can only observe a similarity in eye-balling which is not enough. I don't want to assume causality and I don't want to predict at this point. So, I assume that correlation is the way to go.

Spearman's Correlation Coefficient is used to compare non-normal data and since I don't know anything about the real underlying distribution in my data, I think it would be a save bet. I wonder if this measure can also be used to compare distributional data rather than the data poitns that are summarized in a distribution. Here the example code in R that exemplifies what I would like to check:

aNorm <- rnorm(1000000)
bNorm <- rnorm(1000000)
cUni <- runif(1000000)
ha <- hist(aNorm)
hb <- hist(bNorm)
hc <- hist(cUni)
print(ha$counts)
print(hb$counts)
print(hc$counts)
# relatively similar
n <- min(c(NROW(ha$counts),NROW(hb$counts)))
cor.test(ha$counts[1:n], hb$counts[1:n], method="spearman")
# quite different
n <- min(c(NROW(ha$counts),NROW(hc$counts)))
cor.test(ha$counts[1:n], hc$counts[1:n], method="spearman")

Does this make sense or am I violating some assumptions of the coefficient?

Thanks, R.

Ampleforth
  • 413
  • 4
  • 11
  • Spearman's Correlation Coefficient is used to compare relative rank orders. It strikes me that when comparing normal distributions in this way you are particularly unlikely to detect differences in kurtosis. – russellpierce Jul 29 '10 at 21:53
  • Not adding that this code is a total junk; you must normalize histograms somehow, cutting the end is not a good idea (still I have no idea how to do it). And make code a code (indent with 4 spaces). –  Jul 29 '10 at 22:13
  • Indent your code with 4 spaces to get it converted into code. Or, equivalently, select it and use the button with binary data. –  Sep 25 '10 at 17:05
  • mdb: Not sure what you mean. I would not ask if it would all be correct and without question. I am not sure what you mean with indenting. I hope it has nothing do to with using this website 'right' ;) – Ampleforth Sep 25 '10 at 19:22
  • I think it's just to remind you that this is a Markdown enabled website, so you can benefit from easy syntax highlighting (which facilitates reading) through md markup (http://j.mp/9bQHMC) -- or the utilities provided in the on-line editor. – chl Sep 25 '10 at 21:38
  • 1
    yes, Spearman's rho does not apply to this setting, only to paired observations. I cannot but notice this coincides with a book review I just wrote where the author uses Spearman's rho in the same inappropriate way. – Xi'an Jan 18 '12 at 12:58

3 Answers3

10

Rather use Kolmogorov–Smirnov test, which is exactly what you need. R function ks.test implements it.

Also check this question.

  • 1
    KS test always turns out to be non-normal with very large sets (I have dataset larger than 1 million data points). Also, I would like to have a quantifying measure that tells me about goodness of fit rather than just a test. Am I asking for too much? Thanks in advance, A – Ampleforth Jul 29 '10 at 21:21
  • I agree with the question answerer. What you are looking for is exactly a KS test. The heightened ability of a KS test to detect violations of normality in large datasets is not a reason to toss it aside. But, you may want to set different thresholds in terms of the D statistic depending on your sample size or get into the guts of the equation and see if you can remove the increased likelihood of rejecting the null as a function of sample size. – russellpierce Jul 29 '10 at 21:50
  • And as drknexus wrote, it is easy to extract the statistic and use it for comparison. Even the p-values will do. –  Jul 29 '10 at 22:15
  • Is it actually valid to use p-values as a qualitative measure if they are not statistically significant (i.e. below .05)? Wouldn't it be much better to have some kind of effect size? – Ampleforth Jul 29 '10 at 23:25
  • 5
    Why below 0.05, not 0.04973? Statistical significance does not have any in-depth meaning, it is just an accepted probability of analysis failure. The operation of transforming statistic into a p-value is monotonic, so there is no problem with comparison. Still obtaining a significance level of this comparison is problematic (I have no better idea than bootstrap). –  Jul 30 '10 at 00:20
  • Hi drknexus, you said "set different thresholds in terms of the D statistic depending on your sample size". How can this be achieved with kstest. I did not find a way to manipulate thresholds. I assume that with "getting into the guts of the equation" you mean transformations before testing. My distribution looks multi-modal and I have already an idea what these multi-modal tendencies might be i.e. how I can single them out... – Ampleforth Jul 30 '10 at 00:42
7

For measuring the bin frequencies of two distributions, a pretty good test is the Chi Square test. It is exactly what it is designed for. And, it is even nonparametric. The distribution don't even have to be normal or symmetric. It is much better than the Kolmogorov-Smirnov test that is known to be weak in fitting the tails of the distribution where the fitting or diagnosing is often the most important.

Spearman's correlation won't be so precise in terms of capturing the similarities of your actual bin frequencies. It will just tell you that your overall ranking of observations for the two distributions are similar. Instead, when calculating the Chi Square test (long hand so to speak) you will be able to observe readily which bin frequencies differentials are the most responsible for driving down the overall p value of the Chi Square test.

Another pretty good test is the Anderson-Darling test. It is one of the best tests to diagnose the fit between two distributions. However, in terms of giving information about the specific bin frequencies I suspect that the Chi Square test gives you more information.

Sympa
  • 6,862
  • 3
  • 30
  • 56
  • Thank you for your input. I will read more about the Anderson-Darling test, which I have not heard of . Regarding the Chi-Square test, doesn't this test require that the distribution is Chi Square? I agree that all other assumptions are very relaxed... – Ampleforth Sep 25 '10 at 19:28
  • Ampleforth, I am not sure that is the case. I think you can test all sorts of data with the Chi Square test. – Sympa Sep 26 '10 at 04:18
5

The Baumgartner-Weiss-Schindler statistic is a modern alternative to the K-S test, and appears to be more powerful in certain situations. A few links:

edit: in the years since I posted this answer, I have implemented the BWS test in R in the BWStest package. Use is as simple as:

require(BWStest)
set.seed(12345)
# under the null:
x <- rnorm(200)
y <- rnorm(200)
hval <- bws_test(x, y)
shabbychef
  • 10,388
  • 7
  • 50
  • 93