I recognize that there are many questions on Cross Validated about scaling and PCA, but, after reading all of them, I still can't find the answer to my question. Several people have said that the question "PCA on correlation or covariance?" is a duplicate. However, it addresses why BOTH centering and scaling are needed, but it does not address each operation individually. Thus, my question is unique.
Why is scaling the data (often done by dividing by the standard deviation) needed before PCA?
The reason that I often find is the need to ensure that data from different units of measurement are standardized. However, it seems that CENTERING the data (subtracting each data vector by its mean) sufficiently addresses this issue.
Suppose that
- X1 is in kilometres
- X2 is in metres,
- U1 = X1 - mean(X1)
- U2 = X2 - mean(X2)
then the effect of using different units of measurement is removed, but raw deviations are retained.
In fact, it seems to be me that scaling by the standard deviation actually removes useful information. PCA seeks the data vectors that capture most of the VARIATION in the data set. If you divide each data vector by its standard deviation, then the standard deviation in each data vector is 1, and you just lost all of the variation that you sought to capture in the first place.
To those who think that scaling IS necessary before PCA, please tell me why I am wrong.
Thank you.