In PCA, do the principal components beyond the first optimize any expression?

Question

Given a covariance matrix $\mathbf\Sigma$, the first principal component $u_1$ is the unit vector that maximizes variance $u_1'\mathbf\Sigma u_1$. Do there exist similar expressions that the first $k$ principal components optimize when taken together? In other words, what do we maximize/minimize when we take out these principal components greedily?

One thought is that the first $k$ principal components define a subspace that maximizes the sum of norms of the projected vectors. This is indeed true when we maximize variance while calculating the first principal component. However, I'm not sure if this intuition, or something else, holds in general.

As expressed in (for example) [this answer](http://stats.stackexchange.com/a/110546/3277), after PC1 is "removed" (as if), the data projects on the remaining subspace and is "seen" as a new data to which PCA is applied again. The PC1 derived from there _is_ the PC2 of the original data. So yes, each PC maximizes variance left "so far". Moreover, sequential removing of PCs (PC1 then PC2 then...) isn't the only valid way. You may extract _any_ arbitrary subset of the PCs today, obtain the residual data and come tomorrow to extract the remaining PCs. — ttnphns, Jul 07 '16 at 01:42
Yup, that's the basis of my question about the greedy removal. I'm interested in whether the subspace defined by the first $k$ principal components has any meaning in addition to the construction you describe. That is, does this "locally" optimal choice for principal component having any "global" interpretation? — cubesteak, Jul 07 '16 at 02:09
The answer is **Yes!** They maximize $\sum u_k'\boldsymbol\Sigma u_k$ under the constraint that all $u_k$ are unit length and orthogonal. In addition to the answer below, please see http://stats.stackexchange.com/questions/10251, http://stats.stackexchange.com/questions/32174, http://stats.stackexchange.com/questions/102658. — amoeba, Jul 07 '16 at 10:50

score 6 · Accepted Answer · edited Apr 13 '17 at 12:44

The first $k$ principal components minimize the squared reconstruction error. That is, we project the data onto the first $k$ principal components, then back into the original space to obtain a 'reconstruction' of the data. The first $k$ principal components are the vectors that minimize the sum of squared distances between each point and its reconstruction (the paper below mentions this point, among many other sources).

Among all sets of $k$ vectors, the first principal components do not maximize the sum of the variance of the data projected onto each vector. For example, in many cases we could increase the variance by making all vectors point near the direction of the first principal component. But, if we constrain the vectors to be orthogonal (as PCA does), then the first principal components do indeed have this property (e.g. see here).

Another interpretation is that the first $k$ principal components maximize the likelihood of a particular Gaussian latent variable model. See the following paper:

Tipping & Bishop (1999). Probabilistic principal component analysis.

In PCA, do the principal components beyond the first optimize any expression?

1 Answers1