68

I have several hundred measurements. Now, I am considering utilizing some kind of software to correlate every measure with every measure. This means that there are thousands of correlations. Among these there should (statistically) be a high correlation, even if the data is completely random (each measure has only about 100 datapoints).

When I find a correlation, how do I include the information about how hard I looked for a correlation, into it?

I am not at a high level in statistics, so please bear with me.

amoeba
  • 93,463
  • 28
  • 275
  • 317
David
  • 855
  • 1
  • 8
  • 7
  • 6
    This is a great example of why one needs multiple hypothesis testing. –  Dec 26 '10 at 13:44
  • 1
    Presumably one can use the permutation procedure to generate a null distribution for significance thresholds for the largest correlation, a different threshold for the second largest correlation, and so on. Hopefully this would only take a few hours in Python or R. (Ha! Famous last words.) But surely someone must already have done this and saved the code somewhere? –  Nov 19 '12 at 00:40
  • 4
    @tmo `R` on this machine takes 18 seconds to obtain 1000 realizations of the null permutation distribution of the max correlation coefficient for a 300 by 100 matrix `x`: `correl – whuber Nov 19 '12 at 17:45

4 Answers4

75

This is an excellent question, worthy of someone who is a clear statistical thinker, because it recognizes a subtle but important aspect of multiple testing.

There are standard methods to adjust the p-values of multiple correlation coefficients (or, equivalently, to broaden their confidence intervals), such as the Bonferroni and Sidak methods (q.v.). However, these are far too conservative with large correlation matrices due to the inherent mathematical relationships that must hold among correlation coefficients in general. (For some examples of such relationships see the recent question and the ensuing thread.) One of the best approaches for dealing with this situation is to conduct a permutation (or resampling) test. It's easy to do this with correlations: in each iteration of the test, just randomly scramble the order of values of each of the fields (thereby destroying any inherent correlation) and recompute the full correlation matrix. Do this for several thousand iterations (or more), then summarize the distributions of the entries of the correlation matrix by, for instance, giving their 97.5 and 2.5 percentiles: these would serve as mutual symmetric two-sided 95% confidence intervals under the null hypothesis of no correlation. (The first time you do this with a large number of variables you will be astonished at how high some of the correlation coefficients can be even when there is no inherent correlation.)

When reporting the results, no matter what computations you do, you should include the following:

  • The size of the correlation matrix (i.e., how many variables you have looked at).

  • How you determined the p-values or "significance" of any of the correlation coefficients (e.g., left them as-is, applied a Bonferroni correction, did a permutation test, or whatever).

  • Whether you looked at alternative measures of correlation, such as Spearman rank correlation. If you did, also indicate why you chose the method you are actually reporting on and using.

whuber
  • 281,159
  • 54
  • 637
  • 1,101
  • 2
    This is a pretty thorough descriptiion of p-value adjustment methods but what is left unsaid is the criteria for adjustment. Traditionally it has been familywise error rate. But that is a strict criterion and is not useful when you are looking at thousands of comparisons. In that case the false discovery rate first suggested by Benjamini is now commonly used. – Michael R. Chernick May 05 '12 at 03:01
  • What if we just want to look at correlations of very well defined pairs of variables (e.g. $corr(x_1,y_1)$,...,$corr(x_n,y_n)$, where each $x_i$ and $y_i$ are variables) but don't care about all the other possible combinations (i.e., don't care about $corr(x_i,y_j)$ $\forall i \not= j$)? Do we still need a correction? – Jase Dec 16 '12 at 16:19
  • @Jase Yes, you do. The amount of correction depends on the interrelationships among the variables. Simulation-based methods are about the only practicable way to determine these corrections. – whuber Dec 16 '12 at 16:31
  • Wow nice. Will this method that you discussed also correct the standard errors for serial correlation and heteroscedasticity issues? – Jase Dec 16 '12 at 16:38
  • @Jase It would be difficult to interpret correlation coefficients in a heteroscedastic model. Your comment appears to refer to a linear model in a time series setting, rather than estimation of multivariate correlation coefficients. – whuber Dec 16 '12 at 17:23
  • How important is it to report the permutation test if the number of observations is very large (say N = 100.000)? I find that the correlation for the permuted observations is very close to 0 and thus it seems redundant to report this if that is always the case for large N. I am using permutation testing in the context of finding the highest Pearson corr. for pairwise comparisons (see http://stackoverflow.com/questions/33650188/efficient-pairwise-correlation-for-two-matrices-of-features/33651442#33651442). – pir Nov 12 '15 at 17:27
  • +1. The link for "standard methods to adjust the p-value" is broke. Could you please repair it? Thanks. – Hans Mar 28 '18 at 07:09
10

From your follow up response to Peter Flom's question, it sounds like you might be better served by techniques that look at higher level structure in your correlation matrix.

Techniques like factor analysis, PCA, multidimensional scaling, and cluster analysis of variables can be used to group your variables into sets of relatively more related variables.

Also, you may want to think theoretically about what kind of structure should be present. When your number of variables is large and the number of observations is small, you are often better relying more on prior expectations.

Jeromy Anglim
  • 42,044
  • 23
  • 146
  • 250
8

Perhaps you could do a preliminary analysis on a random subset of the data to form hypotheses, and then test those few hypotheses of interest using the rest of the data. That way you would not have to correct for nearly as many multiple tests. (I think...)

Of course, if you use such a procedure you will be reducing the size of the dataset used for the final analysis and so reduce your power to find real effects. However, corrections for multiple comparisons reduce power as well and so I'm not sure that you would necessarily lose anything.

Michael Lew
  • 10,995
  • 2
  • 29
  • 47
  • 6
    (+1) This is a great idea generally. For large correlation matrices, however, there are so many statistics and so many of them can simultaneously be spuriously large that it usually pays to adjust. Otherwise you wind up chasing a large number of misleadingly "significant" correlations that just disappear in the hold-out data. (Run a simulation with, say, a few hundred draws from 50 uncorrelated standard normal variates. It's an eye-opener.) – whuber Dec 31 '10 at 05:14
7

This is an example of multiple comparisons. There's a large literature on this.

If you have, say, 100 variables, then you will have 100*99/2 =4950 correlations.

If the data are just noise, then you would expect 1 in 20 of these to be significant at p = .05. That's 247.5

Before going farther, though, it would be good if you could say WHY you are doing this. What are these variables, why are you correlating them, what is your substantive idea?

Or, are you just fishing for high correlations?

Peter Flom
  • 94,055
  • 35
  • 143
  • 276
  • 3
    The reason why I wanted to do it like this was to have an open mind towards understanding my data, so maybe in a way I am fishing for correlations, which I did not think of before, for the purpose of getting enlightened. I am certainly not doing this to satisfy my boss or something abitrary. I would rather not get into the specifics of the data, as I want a general answer to this question, so I can use it in all situations in the future. – David Dec 25 '10 at 22:43