5

I'm quite familiar with Principal Component Analysisis, as I use it to study genetic structure. Lately, I was revisiting some of the functions I was using in R (pcoa() from the ape package and prcomp()) and I realized they don't give the same results for the explained variance, and I'm not sure which one to believe.

My distance matrix is already centered and might sometimes contain negative eigenvalues. I'm aware that pcoa() uses eigenvalue decomposition while prcomp() uses single value decomposition, so I expect the results to be slightly different. The scales of the axis that I obtain with each analysis are not the same, but why is the explained variance almost double with prcomp() (in the full dataset)?

I did some digging and apparently pcoa() does a transformation of the distance matrix $M$ of dimensions $m \times m$

$D=O(-0.5M^2)O$

where

$O=I-1/m$

$I$ being an identity matrix of $m \times m$. The rest of the PCoA is performed with this $D$ matrix in a similar fashion as in prcomp() (but using eigenvalue decomposition instead of single value decomposition). Might this be the cause? Part of the transformation centers the data, but why the $-0.5M^2$? I can't seem to find anywhere the reason for this.

Example dataset

With a reduced dataset (image 1) the explained variances for PC1 and PC2 are 0.44 and 0.30 with prcomp(); 0.36 and 0.27 with pcoa(). I understand that a small difference is normal, since both perform slighlty different methods, but with the full dataset (image 2), the PCs are almost identical but divided by 10 yet they explain 0.32 and 0.22 in prcomp() and 0.14 and 0.11 with pcoa()!

Sorry but I can't figure out how to produce a better reduced dataset...

Kdist<-matrix(c(0.06,0.73,0.76,1.28,1.25,1.27,1.38,1.34,1.43,1.35,
0.73,0.01,0.76,1.31,1.31,1.28,1.38,1.35,1.40,1.34,
0.76,0.76,0.06,1.27,1.31,1.29,1.36,1.34,1.39,1.32,
1.28,1.31,1.27,0.17,0.67,0.89,1.36,1.31,1.28,1.31,
1.25,1.31,1.31,0.67,0.00,0.94,1.35,1.32,1.38,1.34,
1.27,1.28,1.29,0.89,0.94,0.07,1.29,1.30,1.24,1.30,
1.38,1.38,1.36,1.36,1.35,1.29,0.11,0.96,0.81,0.87,
1.34,1.35,1.34,1.31,1.32,1.30,0.96,0.09,0.88,0.96,
1.43,1.40,1.39,1.28,1.38,1.24,0.81,0.88,0.13,0.93,
1.35,1.34,1.32,1.31,1.34,1.30,0.87,0.96,0.93,0.15),10,10)

pc<-ape::pcoa(Kdist)
prc<-prcomp(Kdist)

vec.prc <- -prc$rotation[ ,1:2]
var.prc <- round(prc$sdev^2/sum(prc$sdev^2),2)
vec.pcoa <- pc$vectors[ ,1:2]
var.pcoa <- round(pc$values$Relative_eig[1:2],2)

par(mfrow=c(1,2))
plot(vec.prc, main="prcomp",pch=19,cex=2,
     xlab=var.prc[1], ylab=var.prc[2])
plot(vec.pcoa, main="ape::pcoa",pch=19,cex=2,
     xlab=var.pcoa[1], ylab=var.pcoa[2])

small dataset example Big dataset example

amoeba
  • 93,463
  • 28
  • 275
  • 317
Athe
  • 51
  • 2
  • 1
    See here https://stats.stackexchange.com/questions/14002/whats-the-difference-between-principal-component-analysis-and-multidimensional. Some of the answers (e.g. mine) contain and discuss the formula that you are asking about. – amoeba May 10 '19 at 19:24
  • 3
    I gave the link above, but I do NOT think it's a duplicate. I vote to reopen. – amoeba May 12 '19 at 20:30
  • @amoeba I understand the relationship between MDS and PC(o)A, my question is more about the difference between using singular value decomposition on or eigenvalue decomposition to calculate the Principal Components and most importantly, the explained variance. To put it simply, why, if I get the same shape with the PCs of both methods, I get such different estimates of explained variance? – Athe May 13 '19 at 09:41
  • 1
    "the difference between using singular value decomposition on or eigenvalue decomposition" has nothing to do with it! This is a red herring in your question (that's actually why I edited the title to remove it from the title). PCA can be computed using SVD or using eig(), this does not change the result, or the explained variances. Anyway, the question is very good. I think there are several things going on here; note `prcomp(Kdist)` takes `Kdist` as the data matrix, i.e. rows as points and columns as features (which is wrong!) and computes the covariance matrix of this data matrix 8-[] – amoeba May 13 '19 at 12:18
  • 1
    I did not think it through and can't exactly explain why this leads to almost -- but not exactly! -- the same PCs but so different variances. I'll put a bounty now and maybe somebody will give a good answer. If not, and if you are still interested, feel free to ping me again later. – amoeba May 13 '19 at 12:21
  • @amoeba I've been doing some more digging. The value I am using for "explained variance" in the `apea::pcoa` is a "Relative eigenvalue", so, the eigenvalue of the PC divided by the total sum. Being that there are *negative* eigenvalues, I wonder wether this is an adequate computation of the explained variance. However, for the love of my life I cannot find anywhere how to compute the percentage of explained variance when there are negative eigenvalues (am I missing something??) – Athe May 16 '19 at 09:03

1 Answers1

2

This is my first time answering a question on this forum! Hope it is helpful. If I have done anything wrong in the way I have provided the answer, let me know so I can correct.

pcoa() takes the distance matrix as the input. However it will accept a matrix or a distance object.

prcomp() takes the raw dataset.

You were giving the same matrix to both functions.

Also, the previous code plotted the coefficients from prcomp() instead of the scores. I changed the matrix from a 10x10 to a 20x5 - which makes those kinds of (easy to make) mistakes obvious.

Try this:

K<-matrix(c(0.06,0.73,0.76,1.28,1.25,1.27,1.38,1.34,1.43,1.35,
        0.73,0.01,0.76,1.31,1.31,1.28,1.38,1.35,1.40,1.34,
        0.76,0.76,0.06,1.27,1.31,1.29,1.36,1.34,1.39,1.32,
        1.28,1.31,1.27,0.17,0.67,0.89,1.36,1.31,1.28,1.31,
        1.25,1.31,1.31,0.67,0.00,0.94,1.35,1.32,1.38,1.34,
        1.27,1.28,1.29,0.89,0.94,0.07,1.29,1.30,1.24,1.30,
        1.38,1.38,1.36,1.36,1.35,1.29,0.11,0.96,0.81,0.87,
        1.34,1.35,1.34,1.31,1.32,1.30,0.96,0.09,0.88,0.96,
        1.43,1.40,1.39,1.28,1.38,1.24,0.81,0.88,0.13,0.93,
        1.35,1.34,1.32,1.31,1.34,1.30,0.87,0.96,0.93,0.15),
      20,5) # changed dimensions of matrix so 20 rows and 5 columns

# added 'dist()'
Kdist <- dist(K) 
pc<-ape::pcoa(Kdist)
prc<-prcomp(K)

# percent variance explained matches now
pc$values[,2]
prc$sdev^2/sum(prc$sdev^2)

# Images are inverted, but same pattern
# Axes on plots now match
vec.prc <- prc$x[ ,1:2] # changed from 'rotation' to 'x'
var.prc <- round(prc$sdev^2/sum(prc$sdev^2),2)
vec.pcoa <- pc$vectors[ ,1:2]
var.pcoa <- round(pc$values$Relative_eig[1:2],2)

par(mfrow=c(1,2))
plot(vec.prc, main="prcomp",pch=19,cex=2,
     xlab=var.prc[1], ylab=var.prc[2])
plot(vec.pcoa, main="ape::pcoa",pch=19,cex=2,
     xlab=var.pcoa[1], ylab=var.pcoa[2])
user136236
  • 103
  • 8