In PCA, what is the connection between explained variance and squared error?

Question

For an observations matrix $X$, I use PCA to reduce the dimension of the data to $L$. I know that in this case, PCA is guaranteed to minimise the mean reconstruction error. In Wikipedia notation, something like $$\|X-T_LW_L^T\|^2_2$$ where $W$ are the loadings, $T$ are the scores, and the subscript $L$ indicates that only $L$ components are used.

I can also ask how much of the total variance is explained by these $L$ components.

My question is, is there a mathematical relation between the two? Assuming I have only the squared error and the covariance matrix for $X$, can I compute the explained variance?

amoeba · Accepted Answer · 2019-11-26T06:49:31.613

If your $\newcommand{\X}{\mathbf X}\X$ is a $n\times d$ matrix with column means subtracted and $\newcommand{\W}{\mathbf W}\W_L$ is a $d \times L$ matrix consisting of $L$ principal directions (eigenvectors of the covariance matrix), then reconstruction error is given by $$\mathrm{Error}^2 = \|\X - \X\W_L^\vphantom{\top}\W_L^\top\|^2.$$ Note that this is not a mean squared error, it is the sum of squared errors. Here $\X\W_L$ is what you called $\mathbf T_L$ in your question and $\|\cdot\|$ denotes Frobenius norm.

The proportion of explained variance in PCA can be defined as $$\text{Proportion of explained variance} = \frac{\|\X\W_L\|^2}{\|\X\|^2},$$ i.e. it is a ratio of scores' sum of squares to the overall sum of squares (or equivalently, a ratio of scores' total variance to the overall total variance).

To see the connection between them, we need the following identity: $$\|\X\|^2 = \|\X - \X\W_L^\vphantom{\top}\W_L^\top\|^2 + \|\X\W_L^\vphantom{\top}\W_L^\top\|^2 = \|\X - \X\W_L^\vphantom{\top}\W_L^\top\|^2 + \|\X\W_L\|^2.$$ It might look mysterious, but is actually nothing else than Pythagoras theorem; see an informal explanation in my answer to Making sense of principal component analysis, eigenvectors & eigenvalues and a formal explanation in my answer to PCA objective function: what is the connection between maximizing variance and minimizing error?

We now see that $$\text{Proportion of explained variance} = 1-\frac{ \text{Error}^2}{\|\X\|^2}.$$

If you know the covariance matrix $\mathbf C = \frac{1}{n-1}\X^\top \X$, then you can compute the total variance $\operatorname{tr}\mathbf C$. The squared norm $\|\X\|^2$ is then given by the total variance multiplied by $n-1$. So we finally obtain

$$\text{Proportion of explained variance} = 1-\frac{\text{Error}^2}{(n-1)\operatorname{tr}\mathbf C}.$$

Great answer! But I believe it should be 1 - (Error^2)/||X||^2 (and same for the last equation). — Hypercube, Nov 26 '19 at 03:04

Ernest A · Answer 2 · 2015-12-23T08:59:26.077

Let $ W= \left[ \begin{matrix}W_1 & W_2\end{matrix} \right]$ and $T= \left[ \begin{matrix}T_1 & T_2\end{matrix} \right]$ be the partitioned $W\in\mathbb{R}^{M\times M}$ and $T\in\mathbb{R}^{N\times M}$ matrices, where $W_1$ and $T_1$ are the submatrices containing columns $1\dots L$ of $W$ and $T$ respectively. The data matrix can be written in terms of these submatrices as $$ X = TW^\top = T_1 W_1^\top + T_2 W_2^\top $$ which implies $T_1=XW_1$ and $T_2=XW_2$. On the other hand the reconstructed data matrix is defined to be $\hat X=T_1W_1^\top$. Using all these equations we get that the error matrix $E=X-\hat X$ is $$ E = X - \hat X = X - T_1 W_1^\top = T_2 W_2^\top = XW_2 W_2^\top $$ The rows of $E$ are the error vectors, that is the difference between the original data vectors—the rows of $X$—and the reconstructed vectors—the rows of $\hat X$. Therefore the trace of $EE^\top$, which is the sum of the squared lengths of the error vectors, is the total squared reconstruction error.

So what is the trace of $EE^\top$?

We know that $$ E^\top E = W_2 W_2^\top X^\top XW_2 W_2^\top $$ and using $(N-1)S=X^\top X$, where $S$ is the covariance matrix of $X$, in the above equation $$ \begin{align} E^\top E &= (N-1)W_2 W_2^\top SW_2 W_2^\top \\ &= (N-1)W_2 W_2^\top W_2 D_2 W_2^\top \\ &= (N-1)W_2 D_2 W_2^\top \end{align} $$ where $D_2$ is the diagonal matrix with the lowest $(M-L)$ eigenvalues of $S$. Finally, using the properties of the trace we get $$ \begin{align} \text{trace}(EE^\top) &= \text{trace}(E^\top E) \\ &= \text{trace}((N-1)W_2 D_2 W_2^\top) \\ &= \text{trace}((N-1)W_2^\top W_2 D_2) \\ &= \text{trace}((N-1) D_2) \end{align} $$ This is telling us that total squared reconstruction error is just the sum of the lowest $(M-L)$ eigenvalues of $S$ multiplied by the constant $(N-1)$. Therefore minimising the total squared reconstruction error is equivalent to maximising the sum of the $L$ highest eigenvalues of $S$, which in turn is equivalent to maximising the variance of the first $L$ principal component scores.

In PCA, what is the connection between explained variance and squared error?

2 Answers2