1

I have a data $1600\times5000$ matrix $X$ containing 1600 datapoints in 5000-dimensional space. Using MATLAB's built-in pca function, I get the loadings in coeff.

In theory, coeff*coeff' should give us a almost-indentity matrix. For example:

coeff = pca(rand(1000,1000));
coeff*coeff';

However, in my case, coeff*coeff' is far away from identity, with some of the diagonal entries as low as 0.01. As a result, if I wish to reconstruct my data points, even with all the PCs, I worry that the results may be lousy.

What is the possible explanation for this? And is there a way I can get around this problem?

amoeba
  • 93,463
  • 28
  • 275
  • 317
Sibbs Gambling
  • 2,208
  • 5
  • 20
  • 42
  • 1
    If you run `coeff=pca(rand(1600,5000))`, you will see that `coeff` is of 5000x1599 size. Meaning that with 1600 points you can only find 1599 principal components in 5000-dimensional space. Of course `coeff*coeff'` will then not be an identity 5000x5000 matrix, because it will be low rank and have 5000-1599 zero eigenvalues. The reconstruction of your data points should still be perfect though (and not "lousy"), because all your points lie in this 1599-dimensional subspace. Does this make sense? – amoeba Apr 23 '15 at 17:09
  • @amoeba I see, that makes perfect sense! So in the case where we have more data points than dimension, we would have an identity matrix, right? – Sibbs Gambling Apr 23 '15 at 17:20
  • 1
    Yes. I can post this as an answer, but perhaps you should clarify whether you were simply afraid that the reconstruction results would be lousy, or whether you did it and they were lousy. In the latter case, you did something wrong. – amoeba Apr 23 '15 at 17:24

1 Answers1

2

If you run

coeff = pca(rand(1600,5000))

you will see that coeff is of $5000\times 1599$ size. Meaning that with $1600$ points you can only find $1599$ principal components in the $5000$-dimensional space. For the reason why, see here: Why are there only $n-1$ principal components for $n$ data points if the number of dimensions is larger or equal than $n$?

Of course the $5000\times 5000$ matrix coeff*coeff' can not be an identity matrix, because it is of low rank (rank $1599$) and must have $5000-1599=3401$ zero eigenvalues (whereas identity matrix has all eigenvalues equal to one and is full rank).

However, if you use all PCs for reconstruction, then the reconstruction of your data points should still be perfect (and not "lousy"), because all your data points lie precisely in the $1599$-dimensional subspace found by PCA.

By the way, note that coeff'*coeff will be a $1599\times 1599$ identity matrix.

amoeba
  • 93,463
  • 28
  • 275
  • 317