I have two questions related to the PCA:
- Is the higher weight assigned to the variable which not necessarily have higher variation but are more representative of the variation of other variables? (i.e. the PCA aims to catch the common variation)
- Since the input of PCA is just the correlation matrix, which is not affected by the scale, is standardization required by PCA?
Below is a toy dataset I've created to help in explaining my doubts. Please correct me if there is anything incorrect. Thanks so much!
set.seed(99)
a = sample(1:1000,300,replace=TRUE)
b = sample(1:1000,300)
#standardize a and b
a_t = (a-mean(a))/sd(a)
b_t = (b-mean(b))/sd(b)
sd(a)
sd(b)
#the variation of the two variables are different
x=as.matrix(cbind(a,b))
xcor = cor(x)
xcor
x_t=as.matrix(cbind(a_t,b_t))
xcor_t = cor(x_t)
xcor_t
#the correlation matrix is not affected by the standardization
# Eigen decomposition
out = eigen(xcor)
va = out$values
ve = out$vectors
ve
#the two variables are assigned the same weight (absolute value)
# Eigen decomposition
out_t = eigen(xcor_t)
va_t = out_t$values
ve_t = out_t$vectors
ve
#the standardized variables share the same eigenvector are the unstandardized variables
```