Different sample covariance formulae (conventions)

Asked Aug 04 '20 at 18:53

Active Apr 13 '21 at 23:25

Viewed 36 times

Page 358 of Introduction to probability, second edition, by Blitzstein and Hwang, defines the sample covariance as

$$r = \dfrac{1}{n} \sum_{i = 1}^n (x_i - \bar{x})(y_i - \bar{y}),$$

where $\bar{x} = \dfrac{1}{n} \sum\limits_{i = 1}^n x_i$ and $\bar{y} = \dfrac{1}{n} \sum\limits_{i = 1}^n y_i$ are the sample means.

However, I have also seen the sample covariance (matrix) defined as

$$S = \dfrac{1}{n - 1} \sum\limits_{i = 1}^n (\mathbf{X}_i - \bar{\mathbf{X}})(\mathbf{X}_i - \bar{\mathbf{X}})^T.$$

Why is $n$ used in the denominator of the first definition, but $n - 1$ is used in the denominator of the second definition?

I would greatly appreciate it if people would please take the time to explain this.

EDIT:

My understanding is that this is a matter of convention, but why does this difference exist at all? Obviously, these do not produce the same value, so there is clearly a difference between one choice or the other.

edited Apr 13 '21 at 23:25

asked Aug 04 '20 at 18:53

The Pointer

1,064
13
35

Maybe the one with $n$ in the denominator is intended for populations. – BruceET Aug 04 '20 at 19:21
@BruceET But how would that even make sense? These are, after all, *sample* covariances, and so, by definition, they are not intended for populations. – The Pointer Aug 04 '20 at 19:22
See https://en.wikipedia.org/wiki/Sample_mean_and_covariance#Unbiasedness – Sergio Aug 04 '20 at 20:33

Different sample covariance formulae (conventions)

EDIT:

0 Answers0

Linked