How can PCA maximise variance after I standardise all predictor variance = 1?

Question

I have been reading about Principal Components Analysis, and I think it is in general trying to extract as much "variance" out of the predictors $ \vec{X} = (X_1, X_2, ..., X_n)$ by selecting an optimal loading vector $\vec{\phi} = (\phi_1, ..., \phi_n)$ such that

$$Z_1 = \vec{X}^T \vec{\phi} = \phi_1 X_1 + \cdots + \phi_n X_n $$

has maximal variance. We want maximal variance because (usually), the variance in the predictors potentially can explain the variance in some response $Y$ that might be analysed in the future.

However, I have heard that you must standardize the predictors (for example, to have mean 0 and variance 1) if they are not in the same units, and also restrict the loading vector such that $\|\phi\|=1$. This is so the variance of any predictor is not arbitrarily large.

But after I standardize, if all predictors have variance 1, how will principal components analysis identify the most "explantory" predictors (those with high variance) if they are all the same now?

(How will we choose a loading vector and weight the predictors if all of them have the same variance - what would make one variable more favourable than another?)

Thanks in advance

Hi: PCA is generating linear combinations of each of the original predictors. So, if you standardize, each original predictor will have a variance of 1, but when you start taking linear combinations of them, you have wiggle room because now you're generating "new" predictors ( really directions but then you multiply the original data by the directions to get new predictors ) from the original predictors. In fact, these "new" predictors that generated by PCA are independent of one another by definition. This doesn't do PCA justice so you should really google and find a complete explanation. — mlofton, Aug 20 '20 at 00:26
The simple answer to your Q title is: "By standardizing, you've eqaualized the scales of the data along the axes (features) of the scatterplot. This doesn't necessarily mean yet you've turned your data "ellipsoid" into a "ball". — ttnphns, Aug 20 '20 at 11:27
@ttnphns Thanks for your comment! I'm not too familiar with the idea of an ellipsoid or a ball in this context, but do you perhaps mean that while the variance for all predictors is 1, the covariance is not necessarily the same for all predictors, so we maximise variance by collecting many correlated predictors together into principal components? — user523384, Aug 20 '20 at 22:11
I created the figures at https://stats.stackexchange.com/a/71303/919 in part to make it clear how the marginal variances differ from the variances along the principal axes. In this 2D setting, the distinction is between the dimensions of the square and the dimensions of the ellipse inscribed within it. — whuber, Aug 24 '20 at 14:35

score 1 · Answer 1 · answered Aug 20 '20 at 05:24

As also pointed out in the comment above, PCA finds new directions by using linear combinations of existing ones. Let's say, after standardisation your data looks like it's distributed along the line $x=y$ with a small noise. Then, PCA gives you $\phi=[1,1]/\sqrt{2}$ as the first PC, because it is the new direction that the data variance is maximised. The second PC will cover smaller variance and will be orthogonal to this one, i.e. unit vector pointing in the direction of the line $x=-y$.

score 0 · Answer 2 · answered Aug 24 '20 at 01:53

Standardization, unit length, and variance maximization are all unnecessary for the definition of principal components. They cause a lot of needless confusion, as in the OP. There is an alternative way of understanding PCs that makes it all quite a bit easier and more intuitive.

Rather than standardize, then maximize variance subject to unit length constraint, you can instead define the first principle component as any linear combination $L = \sum a_i X_i$ such that the total $R^2$, when predicting each $X_i$ as a regression function of $L$, is maximized.

Specifically, pick any coefficients $a_i$ so that $\sum_i R^2(X_i | L)$ is a maximum. No need for standardization, maximizing variance, or unit length constraint. This gives you a linear combination that is proportional to the usual first principal component.

The awkward interpretation that the first PC "captures/explains the most variance," even though the variance maximization problem is not clearly connected to that interpretation, is now perfectly sensible when the PC is defined as the maximizer of the total explained variance.

The second and remaining PCs can be defined similarly, and there is not even any need for orthogonality constraints in their definitions.

See https://pubmed.ncbi.nlm.nih.gov/28715259/ for further details, and for first sources.

How can PCA maximise variance after I standardise all predictor variance = 1?

2 Answers2