3

Apologies if this has been asked before, nothing turned up when I tried to search.

I'm noticing some very interesting behavior when I try to do PCA on pairs of some dummy datasets I just invented, which are permutations of a fixed set (here just the range from 1 to 10.) In R:

x <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
y <- c(10, 2, 1, 5, 4, 3, 9, 8, 7, 6)
z <- c(8, 3, 2, 1, 4, 7, 9, 6, 5, 10)

df1 <- data.frame(x, y)
df2 <- data.frame(x, z)
df3 <- data.frame(y, z)

I then use prcomp:

> prcomp(df1)
Standard deviations (1, .., p=2):
[1] 3.415650 2.581989

Rotation (n x k) = (2 x 2):
        PC1        PC2
x 0.7071068 -0.7071068
y 0.7071068  0.7071068
> prcomp(df2)
Standard deviations (1, .., p=2):
[1] 3.681787 2.185813

Rotation (n x k) = (2 x 2):
        PC1        PC2
x 0.7071068 -0.7071068
z 0.7071068  0.7071068
> prcomp(df3)
Standard deviations (1, .., p=2):
[1] 3.858612 1.855921

Rotation (n x k) = (2 x 2):
         PC1        PC2
y -0.7071068 -0.7071068
z -0.7071068  0.7071068

So, each component of each principal component is either $\frac{\sqrt{2}}{2}$ or $-\frac{\sqrt{2}}{2}$. I'm not sure exactly why this would be, although it makes a certain kind of sense: both variables 'contain the same data' in a sense, and if we didn't see this behavior, we would be 'preferring' one variable over the other.

That's a very high-level and handwavey view of things, though. Also, if I try more than two variables at a time, this behavior disappears:

>prcomp(data.frame(x, y, z))
Standard deviations (1, .., p=3):
[1] 4.208109 2.603247 1.736355

Rotation (n x k) = (3 x 3):
        PC1        PC2        PC3
x 0.5003708  0.8107791 -0.3037538
y 0.5781928 -0.5740476 -0.5797951
z 0.6444549 -0.1144843  0.7560233

Can someone give me some insight into what's going on here?

Oddsee
  • 61
  • 2
  • 3
    This has little to do with permutations and everything to do with the fact that (1) your data are standardized to a common variance and (2) you are performing PCA in just two dimensions. See the extended explanation in my answer at https://stats.stackexchange.com/a/71303/919, especially the matrix $\mathbb Q$ at the very end. For comparison, generate perfectly random data for `x` and `y` and run `prcomp(df1, scale=TRUE)`: you will get the same result. – whuber Sep 17 '19 at 12:12

0 Answers0