2

I often see people and libraries (e.g. here, here or here) reporting p-values anytime they measure correlation coefficients between two random variables. This always makes me wonder, what's exactly their null?

More specifically, what distribution do they assume the correlation coefficients come from (if the null hypothesis is true)?

I am guessing they use:

$ f(r)={\frac {(1-r^{2})^{\frac {n-4}{2}}}{\mathbf {B} \left({\frac {1}{2}},{\frac {n-2}{2}}\right)}}$

as per this formula for the bivariate normal distribution where the true $\rho$ is assumed to be 0. Is this the case? Or is it something else?

Amelio Vazquez-Reina
  • 17,546
  • 26
  • 74
  • 110
  • 1
    Many of them base it off the t-test in simple regression – Glen_b Jun 30 '16 at 05:17
  • Thanks @Glen_b that would be with the sample variance estimated using e.g. a bootstrap sample? – Amelio Vazquez-Reina Jun 30 '16 at 20:20
  • 1
    Hang on I'll post a brief answer. The test indirectly uses the variance of the sample in the calculation of the correlation coefficient, but the test has nothing to do with bootstrap. The sampling distribution of the correlation coefficient arises as a consequence of the usual assumptions made for the inference in regression. – Glen_b Jul 01 '16 at 01:06

1 Answers1

2

Wikipedia mentions a number of tests for correlation.

Many packages base a test with a null of zero correlation off the t-test in simple regression.

Note that both the F test for the regression and the t-test for the coefficient can be re-written as a test in terms of the correlation (and sample size) alone.

For example, if you use R, the same p-value, 1.49e-12 can be seen in the cor.test output and also twice in the regression (lm) output:

> cor.test(~dist+speed,cars)

        Pearson's product-moment correlation

data:  dist and speed
t = 9.464, df = 48, p-value = 1.49e-12
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.6816422 0.8862036
sample estimates:
      cor 
0.8068949 

> summary(lm(dist~speed,cars))

Call:
lm(formula = dist ~ speed, data = cars)

Residuals:
    Min      1Q  Median      3Q     Max 
-29.069  -9.525  -2.272   9.215  43.201 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -17.5791     6.7584  -2.601   0.0123 *  
speed         3.9324     0.4155   9.464 1.49e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared:  0.6511,    Adjusted R-squared:  0.6438 
F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12
Glen_b
  • 257,508
  • 32
  • 553
  • 939