Unintuitive interpretation of eigenvectors

Question

I have some trouble finding a proper explanation of variability of the components of eigenvectors when applying a PCA analysis on a purely random dataset. As a background to my question, I want to make sure PCA analysis I perform on financial returns is sound and I can be reasonably confident that the "economic" interpretation of eigenvectors is not based on pure chance.

Here is some R code to generate 4 uncorrelated realizations of random variables following a normal distribution; all variables exhibits the same mean and volatility (0% and 20% annualized volatility).

k <- 4
n <- 1000
annualized_vol <- .2
pca_elements <- list
for (i in 1:1000) {
    r <- replicate(k, rnorm(n, 0, annualized_vol / sqrt(360)))
    pca <- prcomp(r, scale = TRUE)
    pca_elements[[i]] <- list(
        eigenvalues = pca$sdev^2,
        eigenvectors = pca$rotation
    )
}

When I examine variability of eigenvalues, I get the result I expect

    lapply(pca_elements, `[[`, 'eigenvalues') %>%
        do.call(rbind, .) %>% 
        summary

with all lambda close to 1 (because I scaled the data before performing the PCA), i.e. no leading factor and each one explaining the same amount of variability in the data. All that is true by construction, because I built the data this way.

But when I try to examine what's going on with the first eigenvector for instance, I get any coordinate for those vectors with no clear pattern. My intuition is that my 4-dimensional dataset is a noisy ball centered around 0, so any set of 4 eigenvectors would do to describe a new basis for this dataset.

Am I correct ? How can I be confident that a real dataset would not reproduce this feature ? Doing some cross-validation on my real returns would help ?

Your eigenvectors are drawn uniformly at random in the group of orthogonal matrices, yes. Let's focus on the first eigenvector: it has the same distribution that a ${1 \over ||X||} X$ where $X \sim \mathcal N(0, I_4)$ -- a random vector on the 4 dimensional sphere. The second one is also drawn uniformly at random in the the subsphere of dimension 3 which is orthogonal to the first eigenvector. And so on. $\\$ On real data, some bootstrapping or resampling could be useful to assess the variability of the first PCs, this is a good idea, but it might be tricky. — Elvis, Jan 23 '18 at 15:18
Thank you for this technical answer. Two more questions though : 1°) you said the second vector if drawn "uniformly". My data is normally distributed, is there some kind of transformation going on in this 3-d subsphere ? 2°) How tricky do you imagine it can be ? I just tried 1000x to remove x% of the original dataset, computing eigenvectors and looking at the variability of components. Is this not a proper way to do that ? — user8131, Jan 23 '18 at 15:26
1) I don't understand your question — my point was that, given the first vector, the second vector is not uniform on the 4-sphere as the first one is, it is constrained to be orthogonal to the first one. But if you don't condition on the first vector, the second vector is also uniform on the 4-sphere. This might look a bit technical, but in fact it is a direct consequence on the fact that the distribution your data are invariant by all orthogonal transformations (think "all rotations"), so the distribution of the eigenvectors have the same property. — Elvis, Jan 23 '18 at 15:59
2) The problem is that the eigenvectors are defined uniquely... **up to the sign**. So it is not so straightforward to decide how to compare two eigendecompositions. If you just want to focus on the first PC, I'd suggest to look at the distribution of the absolute value of the scalar product of the eigenvectors (identical or opposed vectors $\rightarrow$ value 1). There are surely some theoretical works on this but unfortunately I don't know any reference. — Elvis, Jan 23 '18 at 16:05
An extensive analysis of this situation (albeit with a slightly different focus) appears in the thread at https://stats.stackexchange.com/questions/139047/weird-correlations-in-the-svd-results-of-random-data-do-they-have-a-mathematica/139059#139059. — whuber, Jan 23 '18 at 16:41

score 1 · Answer 1 · answered Jan 23 '18 at 15:17

1

How can I be confident that a real dataset would not reproduce this feature?

You can use the eigenvalues as a measure for a difference from random datasets. If your data is elongated, then you won't have this feature.

answered Jan 23 '18 at 15:17

Sextus Empiricus

43,080
1
72
161

1

I didn't thought of that, clever. Is that the same perspective as using random matrix theory to remove lower eigenvalues ? – user8131 Jan 23 '18 at 15:30

Unintuitive interpretation of eigenvectors

1 Answers1