I have some trouble finding a proper explanation of variability of the components of eigenvectors when applying a PCA analysis on a purely random dataset. As a background to my question, I want to make sure PCA analysis I perform on financial returns is sound and I can be reasonably confident that the "economic" interpretation of eigenvectors is not based on pure chance.
Here is some R code to generate 4 uncorrelated realizations of random variables following a normal distribution; all variables exhibits the same mean and volatility (0% and 20% annualized volatility).
k <- 4
n <- 1000
annualized_vol <- .2
pca_elements <- list
for (i in 1:1000) {
r <- replicate(k, rnorm(n, 0, annualized_vol / sqrt(360)))
pca <- prcomp(r, scale = TRUE)
pca_elements[[i]] <- list(
eigenvalues = pca$sdev^2,
eigenvectors = pca$rotation
)
}
When I examine variability of eigenvalues, I get the result I expect
lapply(pca_elements, `[[`, 'eigenvalues') %>%
do.call(rbind, .) %>%
summary
with all lambda close to 1 (because I scaled the data before performing the PCA), i.e. no leading factor and each one explaining the same amount of variability in the data. All that is true by construction, because I built the data this way.
But when I try to examine what's going on with the first eigenvector for instance, I get any coordinate for those vectors with no clear pattern. My intuition is that my 4-dimensional dataset is a noisy ball centered around 0, so any set of 4 eigenvectors would do to describe a new basis for this dataset.
Am I correct ? How can I be confident that a real dataset would not reproduce this feature ? Doing some cross-validation on my real returns would help ?