PCA is done through series of orthogonal rotation. My impression of PCA precedure is: First component is on the direction of largest variance and second component is on the orthogonal direction to the first direction. Then the third component is found through another round of orthogonal rotation. What make the second largest variance has to be on the orthogonal direction to the first component?
-
2See for example Point 3 [here](https://stats.stackexchange.com/a/110546/3277). In a 3D space (say), If you remove one any direction you are left with a 2D plane orthogonal to that removed dimension. Any line is this plane will be orthogonal to it. You may rotate the plane _about_ that removed axis as you like and any direction in the place will still be orthogonal to that one. So, when you are finding the 2nd PC, you are not rotating in the initial space, you are rotating in the reduced subspace which is already orthogonal to the first PC. – ttnphns Dec 27 '19 at 13:23
-
1Without some such orthogonality restriction, what would "second largest variance" mean? – whuber Dec 27 '19 at 14:13
3 Answers
Not the second largest variance but the second largest unique variance. The variance of a correlated second component would be larger than the variance of an orthogonal second component, but much of the second component's variance would be shared with the first--the stronger the correlation, the more the shared variance. The more shared variance, the more variance remaining to be explained by other components. If your aim is dimension reduction, orthogonal components get the job done most efficiently. But yes, components can also be rotated into oblique solutions, just as common factor solutions can be rotated.
Try visualizing. Imagine a three dimensional data cloud, with three variables correlated so that you have an oblong (not spherical) shape. There is a longest axis--that is your first principal component. Of course, if you want the largest variance for the second axis, you lay it almost on top of the first axis, but then the second component does not much reduce the unexplained variance remaining in the cloud.

- 1,341
- 4
- 11
Whenever we choose a component during PCA, it is ensured that we find the complete variance of the data on that particular component.
If the 2nd component is not orthogonal to the 1st one, then the 2nd component will have some dependence on the 1st one. This means that the variance of the data on the 1st component hasn't been captured completely yet on the 1st component which is a contradiction!
So each component needs to be orthogonal to the other components.

- 148
- 5
Actually, orthogonality is not needed at all to define the component scores. You can choose them to be orthogonal if you like, but it's not necessary. Neither is unit length or variance maximization, for that matter. You can define the first two scores as linear combinations that maximize the total $R^2$ when performing multiple regression with those scores as predictors of each of the original variables. See here: https://www.tandfonline.com/doi/abs/10.1080/00273171.2017.1340824

- 4,593
- 12
- 22