Why can the covariance matrix be computed as $\frac{X X^T}{n-1}$?

Question

Wikipedia defines the covariance matrix as a generalized version of the variance: $C=E[(X-E[X])(X-E[X])^T]$. But, I usually see statisticians calculate the covariance matrix as $C=\frac{X X^T}{n-1}$ (e.g., Ameoba's nice answer here https://stats.stackexchange.com/a/134283/83526). Is $E[(X-E[X])(X-E[X])^T] = \frac{X X^T}{n-1}$? Are these both correct definitions of the covariance matrix?

Amoeba explicitly prefaces that formula with the statement "Let us assume that it [$X$] is centered, i.e. column means have been subtracted and are now equal to zero." That shows his formula is identical to the one in Wikipedia. — whuber, Jun 14 '17 at 20:38

jld · Accepted Answer · 2017-06-14T20:27:57.233

10

Let $\mu = E(X)$. Then $$Var(X) = E\left((X - \mu)(X - \mu)^T\right) = E\left(XX^T - \mu X^T - X \mu^T + \mu \mu^T\right) \\ = E(XX^T) - \mu\mu^T$$ which generalizes the well-known scalar equality $Var(Z) = E(Z^2) - E(Z)^2$.

The natural estimator of $\Sigma := Var(X)$ is $\hat \Sigma = \frac 1{n-1}XX^T - \hat \mu \hat \mu^T$.

In many situations we can take $\mu = 0$ without any loss of generality. One common example is PCA. If we center our columns then we find that $\hat \mu = 0$ so our estimate of the variance is simply $\frac 1{n-1}XX^T$. The univariate analogue of this is the familiar $s^2 = \frac 1{n-1} \sum_i x_i^2$ when $\bar x = 0$.

As @Christoph Hanck points out in the comments, you need to distinguish between estimates and parameters here. There is only one definition of $\Sigma$, namely $E((X - \mu)(X - \mu)^T)$. So $\frac 1{n-1}XX^T$ is absolutely not the correct definition of the population covariance, but if $\mu=0$ it is an unbiased estimate for it, i.e. $Var(X) = E(\frac 1{n-1}XX^T)$.

edited Jun 14 '17 at 20:27

answered Jun 14 '17 at 16:19

jld

18,405
2
52
65

1

To understand the linear algebra used in the answer better, you can look at https://math.stackexchange.com/questions/198257/intuition-for-the-product-of-vector-and-matrices-xtax/198280#198280 – kjetil b halvorsen Jun 14 '17 at 16:26
+1 I tried a few times to squeeze a comment in here, but the $LaTeX$ didn't show well, so I used the entry of an answer as an extended comment. – Antoni Parellada Jun 15 '17 at 00:14
1

Note that there is a bit of ambiguity in the symbology here (inherited from the question): The symbol $X$ is used to refer to *both* the random variable ($p\times{1}$, vector valued), and the (transposed) [data matrix](https://en.wikipedia.org/wiki/Data_matrix_(multivariate_statistics)) ($p\times{n}$, columns are samples of the random variable). – GeoMatt22 Jun 15 '17 at 00:30

Antoni Parellada · Answer 2 · 2017-06-15T05:36:43.830

COMMENT:

@Chacone's answer is great as is, but I think there is one step that is left unexplained, and it is clearer with expectation notation. To reflect @GeoMatt22's comment in the following proof $X$ is a $p\times 1$ random vector:

$$ \begin{align} \text{Cov}(X)&=\mathbb E\left[\,\left(X- \mathbb E[X] \right) \, \left(X- \mathbb E[X] \right)^\top \right]\\[2ex] &= \mathbb E\left[\,XX^\top - X\,\mathbb E[X]^\top - \mathbb E[X] \,X^\top + \mathbb E[X] \mathbb E\, [X]^\top \right]\\[2ex] &= \mathbb E\left[\,XX^\top\right] - \mathbb E [X]\,\mathbb E[X]^\top - \mathbb E[X] \,\mathbb E [X]^\top + \mathbb E[X] \mathbb E\, [X]^\top \\[2ex] &= \mathbb E[XX^\top] \; -\; \mathbb E[X]\, \mathbb E[X]^\top \end{align} $$

As for the transition to the sample estimate of the population covariance using this alternate (raw moment) formula, the denominators will need adjusing. In what follows $X$ is a $p \times n$ data matrix (following @GeoMat22's comment):

$$\begin{align} \sigma^2(X)&=\frac{XX^\top}{n-1} \; -\; \frac{n}{n-1}\; \begin{bmatrix}\bar X_1\\ \bar X_2\\ \vdots \\ \bar X_p \end{bmatrix}\, \; \begin{bmatrix}\bar X_1 & \bar X_2 & \cdots & \bar X_p \end{bmatrix} \end{align} $$

because it is necessary to multiply by $\frac{n}{n-1}$ to get rid of the biased $n$ denominator, and replace it with $n-1$. Evidently anything after the minus sign disappears if the mean is zero.

Here is an illustrative example in R:

> set.seed(0)
> # Sampling three random vectors: V1, V2 and V3:
> X1 = 1:5
> X2 = rnorm(5, 5, 1)
> X3 = runif(5)

> # Forming a matrix with each row representing a sample from a random vector:
> (X = (rbind(X1,X2,X3)))
         [,1]      [,2]      [,3]      [,4]      [,5]
X1 1.00000000 2.0000000 3.0000000 4.0000000 5.0000000
X2 6.26295428 4.6737666 6.3297993 6.2724293 5.4146414
X3 0.06178627 0.2059746 0.1765568 0.6870228 0.3841037

> # Taking the estimate of the expectation for each random vector, bar Xi:
> mu = rowMeans(X)

> # Calculating manually the variance with the alternate formula
> (man.cov = ((X %*% t(X)) / (ncol(X) - 1)) - (ncol(X)/(ncol(X) - 1)) * (mu %*% t(mu)))
            X1          X2         X3
X1  2.50000000 -0.02449075 0.28142079
X2 -0.02449075  0.53366886 0.02019664
X3  0.28142079  0.02019664 0.05940930

> # Comparing to the built-in formula:
> cov(t(X))
            X1          X2         X3
X1  2.50000000 -0.02449075 0.28142079
X2 -0.02449075  0.53366886 0.02019664
X3  0.28142079  0.02019664 0.05940930

> # are the same...
> all.equal(man.cov, cov(t(X)))
[1] TRUE

Why can the covariance matrix be computed as $\frac{X X^T}{n-1}$?

2 Answers2