4

There seems to be a common use of the uncorrected sample standard deviation in calculating the correlation coefficient:

I'm wondering if I am missing something about the apparent preference of using the corrected sample standard deviation:

I've posted this to:

StatSmartWannaB
  • 411
  • 3
  • 11

1 Answers1

5

It makes no difference whatsoever, provided you compute consistently.

Suppose you define your version of the variance of a batch of data $X = x_1, x_2, \ldots, x_n$, with mean $\bar x = (x_1 + x_2 + \cdots + x_n)/n$, to be

$$\text{var}_f(X) = f(n)\left((x_1-\bar x)^2 + (x_2 - \bar x)^2 + \cdots + (x_n - \bar x)^2\right)$$

where $f:\mathbb{N}\to (0,\infty)$ is a personalized function giving some positive value $f(n)$ for each positive integer $n$. Some people might use $f(n) = 1/n$, others might use $f(n) = 1/(n-1)$, and others might use something else altogether such as $f(n)=\Gamma((n-1)/2)^2/(2\Gamma(n/2)^2)$. It doesn't matter, because as we have seen you must use the same function $f$ to compute the covariance of any batch of paired data $(X,Y) = (x_1,y_1), (x_2,y_2), \ldots, (x_n,y_n)$:

$$\text{cov}_f(X,Y) = f(n)\left((x_1-\bar x)(y_1-\bar y) + (x_2-\bar x)(y_2-\bar y) + \cdots + (x_n-\bar x)(y_n-\bar y)\right).$$

Consequently your personal correlation coefficient will be

$$\eqalign{ \rho_f(X,Y) &= \frac{\text{cov}_f(X,Y)}{\sqrt{\text{var}_f(X)}\sqrt{\text{var}_f(Y)}} \\ &= \frac{f(n)\sum_i(x_i-\bar x)(y_i-\bar y)}{\sqrt{f(n)\sum_i(x_i-\bar x)^2}\sqrt{f(n)\sum_i(y_i-\bar y)^2}}\\ &= \frac{\sum_i(x_i-\bar x)(y_i-\bar y)}{\sqrt{\sum_i(x_i-\bar x)^2}\sqrt{\sum_i(y_i-\bar y)^2}}. }$$

Since the factor $f(n)/(\sqrt{f(n)}\sqrt{f(n)}) = 1$ disappears, your correlation coefficient will be the same as anyone else's. As a bonus, now you know you don't have to compute $f$.

whuber
  • 281,159
  • 54
  • 637
  • 1,101