0

I was trying to compute principle components for the following simulated data:

test = data.frame(x = rnorm(3), x2 = rnorm(3), x3 = rnorm(3), x4 = rnorm(3), x5 = rnorm(3))
pr.out =prcomp (test , scale =TRUE)
pr.out$rotation
        PC1         PC2         PC3
x   0.4976161  0.23472585 -0.62942510
x2  0.2612125  0.77520587  0.03019296
x3 -0.5150470 -0.03722759 -0.46152143
x4 -0.4137207  0.53641384  0.40663018
x5  0.4977027 -0.23416645  0.47388004
pr.out$sdev
1.939906e+00 1.112100e+00 3.716899e-16

Which only gives 3 principal components. I had expected that since the variance - covariance matrix of the data to be 5x5, that the matrix will have 5 eigenvectors, 3 of which correspond to zero eigenvalues, since the number of eigenvectors corresponding to non-zero eigenvalues is at most $min(n-1,p)$. Based on standard deviation of the principal scores, the program seems to be dropping out two of the principal components that correspond to zero eigenvalues.

I went through the documentation but didn't find anything to indicate that it would only output at most n principle components (where n is the number of observations, 3 in this case). Is there a reason why I'm seeing this?

Yandle
  • 743
  • 2
  • 12
  • 1
    Although this question is phrased differently than your previous one at https://stats.stackexchange.com/questions/472055/rank-of-sample-covariance-matrix-when-p-n, it is identical: you can't have more principal components than the rank of the matrix. – whuber Jun 15 '20 at 12:52
  • @whuber Are you referring to the rank of the covariance matrix / number of non-zero eigenvalues? If so then I shouldn't be seeing any PCs for zero EVs. This function does return PCs corresponding to zero EVs. I created 4 predictors that are multicollinear with 50 points, and got 4 PCs back, with 3 capturing none of the variance / zero eigenvalues as expected. I think this is an issue in R unrelated to my previous question and the same as [this unanswered question](https://stats.stackexchange.com/questions/363245/pca-generating-less-principal-components-than-the-number-of-original-variables). – Yandle Jun 15 '20 at 14:51
  • The point is that the software understands that matrix with minimal dimension $d$ cannot have more than $d$ nonzero eigenvalues, so why would it return any more than $d$ principal components? – whuber Jun 15 '20 at 17:05

1 Answers1

1

The reason seems to lie in this remark of the prcomp documentation:

The calculation is done by a singular value decomposition of the (centered and possibly scaled) data matrix, not by using ‘eigen’ on the covariance matrix.

With

eigen(cov(test))

five components are returned (with three eigenvalues pracically zero), but

svd(test, nu=0)

only returns min(row,col) singular values. Note that the third value returned in pr.out$sdev is practically zero.

cdalitz
  • 2,844
  • 6
  • 13
  • So if I understand correctly, if the number of predictors (p) is greater than the number of observations (n), then prcomp and svd ignores n-p of the principal components corresponding to zero eigenvalues? – Yandle Jun 15 '20 at 14:57
  • @Yandle Note that the SVD is not done on the covariance matrix (dimension 5x5), but on the original input data (dimension 3x5). As this has only three rows, the dimensions of the decomposition are restricted. The reuslt is this cut-off due to the special algorithm used by prcomp. As I had evaluated once in a paper, the algorithm based on the SVD is computationally much less efficient than that based on eigen, so I am surprised that prcomp uses it instead. – cdalitz Jun 15 '20 at 15:04
  • If I understand correctly, for an $nxp$ design matrix $X$ that is already centered, $\frac{1}{n}X^TX$ would give the covariance matrix. If I take the SVD of $X$ as $USV^T$, I think the $p$ columns in $V$ are the principal components (since SVD of $\frac{1}{n}X^TX$ is $V\frac{1}{n}S^2V^T$)? Is prcomp only calculating the first $min(n,p)$ eigenvectors of $V$? – Yandle Jun 15 '20 at 19:07
  • R is open source, so you can lookup the implementation of *prcomp* (implemented in pure R). It does not compute any eigenvalues or eigenvectors, but the right singular vectors of the data matrix (with the R function *svd*). These are in this case *fewer* than the eigenvectors of the covariance matrix, because your data matrix has fewer rows than the dimension of the feature space. You can test this easily yourself: create a 2x3 matrix *X* and call *svd(X)* and check the dimensions of the computed matrices *U* and *V* (*S* is a diagonal matrix, so *svd* only returns its diagonal elements). – cdalitz Jun 16 '20 at 09:16