Say that the raw data is $N$-dimensional, where $N$ is a large positive integer. If we apply PCA to the dataset, and compute the reconstruction error (in $\ell^2$-norm) using the first $d \leq N$ components, will the reconstruction error monotonically decrease to $0$ as $d \rightarrow N$?
2 Answers
First, review the connections between PCA and SVD. See: Relationship between SVD and PCA. How to use SVD to perform PCA? We consider a matrix $A\in\mathbb{R}^{m\times n}$ for $m \ge n$ which is appropriately centered and scaled so that applying SVD to $A$ has the essential and direct relationship to PCA as described in the other thread.
Second, we need to review the Eckart–Young–Mirsky theorem. This theorem provides that for the spectral norm $\|\cdot \|_2$, the best rank-$d$ approximation to a matrix $A$ (or indeed a more general set of rectangular, real-valued matrices) is given by $$ A_d = \sum_{i=1}^d \sigma_iu_iv_i^\top $$ where $u_i,v_i$ are the $i$th columns of $U$ and $V$ in $A = USV^\top$ the singular value decomposition of $A$. From the SVD, we can arrange the singular values in $S$ (and corresponding columns in $U$ and $V$) so that $\sigma_1 \ge \sigma_2 \ge \sigma_3 \ge \cdots \ge \sigma_n \ge 0$.
In other words, the Eckart–Young–Mirsky theorem shows that
$A_d$ minimizes $$ \| A - A_d \|_2 $$ for all $A_d$ which have rank $d$.
So all that remains is to show that $\| A - A_d \|_2$ is non-increasing in $d$.
This is trivial because choosing $d$ determines how many singular values you'll retain and how many are zeroed out; changing $d$ is just changing how many singular values are included or excluded. Because we're always retaining the largest $\sigma_i$, we know that reconstruction error must be non-increasing. Naturally when $d=n$, the reconstruction error is zero because all of the data is retained.
This intuition can be demonstrated directly from the definition of the spectral norm, which depends only on the largest singular value of the matrix: $$\| A \|_2 = \max \{\sigma_1, \sigma_2, \dots \sigma_n \}.$$ We know that the largest singular value of $\| A - A_d \|_2$ does not increase as $d$ increases because singular values are non-increasing by construction.
Or we can just crank through the algebra to show the same result.
$$\begin{align} \| A - A_d \|_2 &= \| USV^\top - US_d V^\top \|_2 \\ &= \left\| \sum_{i=1}^n \sigma_iu_iv_i^\top - \sum_{i=1}^d \sigma_iu_iv_i^\top \right\|_2 \\ &= \left\| \sum_{i=d+1}^n \sigma_iu_iv_i^\top \right\|_2 \\ &= \max \{ \sigma_{d+1}, \sigma_{d+2}, \dots, \sigma_{n}, 0 \} \\ &= \sigma_{d+1} \le \sigma_d. \end{align}$$
Conclusion
It is not true that increasing $d$ will always decrease the reconstruction error; in some cases the reconstruction error will remain the same. This happens when the additional $\sigma_i$ are zero. This is due to the connection between SVD and the rank of a matrix: the rank $r$ of $A$ is the number of positive singular values, so if $d$ is chosen such that $d>r$, the reconstruction error of $\|A - A_{d|d>r} \|_2$ will be the same as $\|A - A_r \|_2$.
However, we have proved that the reconstruction error does not increase as $d$ increases.

- 76,417
- 20
- 189
- 313
-
Can the error fail to decrease prior to reaching the rank of the data? – Acccumulation Sep 11 '20 at 02:08
-
@Acccumulation That's a fair point; I've edited to clarify. The number of positive singular values is the rank $r$ of the matrix, so if $d > r$, there is no improvement compared to $d=r$. – Sycorax Sep 11 '20 at 13:28
Yes. The principal vectors are orthogonal therefore all $N$ of them span the $N$-dimensional space of raw signals, i.e. zero reconstruction error.
The reconstruction error comes from the projection onto the orthogonal complement subspace (spanned by the unused principal vectors). With each increase in $d$, this orthogonal complement subspace becomes smaller.
This is true for any raw signal. If you are talking about the reconstruction error of the entire dataset, then not only does it decrease monotonically, it decreases in the fastest way possible, because the principal vectors, which are the eigenvectors of the covariance matrix, are sorted in decreasing values of eigenvalues.

- 670
- 3
- 13
-
Thank you, your answer is intuitively very clear to me! Is it possible to share some references of more technical proofs of these results? – weikeduo Sep 10 '20 at 17:48
-
1If the rank of the data is less than $N$, then once we reach its rank, increasing $d$ further will not decrease the error, making the decrease in error possibly nonstrict. – Acccumulation Sep 11 '20 at 02:08