Problem with PCA in R (suspiciously high explained variance)

Question

I have always been confused about how to properly interpret PCA results.

My data looks like this and it's a big table with more than 5 million rows and 12 columns.(the first few lines are all 0...) Each column is for an individual which has more than 5 million observations (numbers).

> head(data)
       YC1CO YC1LI YC4CO YC4LI YC5CO YC5LI YM1CO YM1LI YM3CO YM3LI
f1     0     0     0     0     0     0     0     0     0     0
f2     0     0     0     0     0     0     0     0     0     0
f3     0     0     0     0     0     0     0     0     0     0
f4     0     0     0     0     0     0     0     0     0     0
f5     0     0     0     0     0     0     0     0     0     0
f6     0     0     0     0     0     0     0     0     0     0

Then I run PCA using prcomp in R:

pca<-prcomp(data,scale=T,center=T)

The outputs are:

pca$rotation:

> pca$rotation
        PC1        PC2          PC3         PC4         PC5         PC6
YC1CO 0.2888377 -0.1511474  0.354970405 -0.14922899  0.29263063 -0.42756650
YC1LI 0.2887845  0.2891378  0.006931811 -0.11867753  0.10465221  0.32239652
YC4CO 0.2888937 -0.1073097  0.376083559 -0.16145206  0.28844683 -0.19929480
YC4LI 0.2888576  0.2899107  0.032538093 -0.10721970  0.11537841  0.19513249
YC5CO 0.2885639 -0.2200563  0.393267987 -0.13833481 -0.80160742  0.20303762
YC5LI 0.2887792  0.2926729  0.010423117 -0.11994739  0.12149153  0.31232174
YM1CO 0.2889243 -0.2483682  0.100978896  0.12858598  0.04456687 -0.19313330
YM1LI 0.2891586  0.2571790 -0.112257791  0.05154060 -0.01859997 -0.02233253
YM3CO 0.2872954 -0.5242998 -0.631712144 -0.47494155  0.05150495  0.08749259
YM3LI 0.2891991  0.2441790 -0.131464167  0.06272038 -0.03712138 -0.03534204
YM5CO 0.2881663 -0.3741525 -0.033566125  0.75538412  0.17427960  0.33801997
YM5LI 0.2886363  0.2481463 -0.369342316  0.27066649 -0.33590255 -0.57941556
        PC7           PC8         PC9        PC10        PC11
YC1CO  0.658953472 -0.0032299313  0.19565297  0.02889161  0.06028938
YC1LI  0.075935158  0.7256745487  0.10496762 -0.36161999 -0.17252148
YC4CO -0.733333390  0.0315817680  0.26028561  0.06764965  0.05879949
YC4LI  0.050613636 -0.0400665068 -0.25306478  0.74445447 -0.38210226
YC5CO  0.040387219 -0.0106509021  0.07336791  0.04776528  0.03597253
YC5LI  0.040072049 -0.6856737279  0.16012889 -0.41945134 -0.17527656
YM1CO -0.085165645 -0.0130678334 -0.78767928 -0.33407254 -0.22202704
YM1LI -0.004723194 -0.0087215382 -0.16061437  0.06378812  0.58204927
YM3CO -0.003226445 -0.0019793258  0.06649693  0.05308833  0.01803736
YM3LI -0.006287346 -0.0087025271 -0.14028837  0.03396984  0.52810233
YM5CO  0.048983778  0.0009123461  0.21714944  0.10450049  0.01040703
YM5LI -0.081931860  0.0139306658  0.26532068 -0.02864126 -0.34310790
              PC12
YC1CO -0.005335094
YC1LI  0.007632148
YC4CO -0.006459107
YC4LI -0.012083181
YC5CO  0.002861339
YC5LI  0.009554891
YM1CO  0.007425773
YM1LI  0.682200634
YM3CO  0.007334849
YM3LI -0.730105933
YM5CO  0.004956218
YM5LI  0.032252049

summary(pca)

> summary(pca)
Importance of components:
                          PC1     PC2     PC3     PC4     PC5     PC6     PC7
Standard deviation     3.4418 0.20675 0.13369 0.11872 0.11105 0.10690 0.10325
Proportion of Variance 0.9872 0.00356 0.00149 0.00117 0.00103 0.00095 0.00089
Cumulative Proportion  0.9872 0.99072 0.99221 0.99338 0.99441 0.99536 0.99625
                          PC8     PC9    PC10    PC11    PC12
Standard deviation     0.10054 0.09888 0.09789 0.09215 0.08375
Proportion of Variance 0.00084 0.00081 0.00080 0.00071 0.00058
Cumulative Proportion  0.99709 0.99791 0.99871 0.99942 1.00000

The eigenvalues are:

> pca$sdev^2
 [1] 11.845894818  0.042746822  0.017872795  0.014093498  0.012331364
 [6]  0.011428471  0.010660422  0.010107398  0.009777983  0.009582267
[11]  0.008490902  0.007013259

I just took the values from pca$rotation above for PC1 and PC2 and plot it.

enter image description here

And the biplot

>biplot(pca)

enter image description here

I have a few specific questions and I would really appreciate it if you can comment to help me understand the plot.

Based on the proportion of variance, I know PC1 explains almost all the variance. I have one question here, does the x axis and y axis values matter at all here? PC1 values are much more closer than those of PC2 values although PC1 is the dominant PC.
Can I see the difference among points YC1CO, YC4CO, YC5CO and YM1CO, YM3CO, YM5CO are what drives PC1?
The initial thought was to show relationships among the 12 individuals (YC1CO, etc) and see if they cluster/separate from each other and if there is some meaningful patterns. That's why I want to plot PC1 and PC2 to show the relative location of 12 individuals. Now I'm confused about how to plot it..

Thanks!

The graph looks like the _loading plot_ which is showing variables in the space of PCs. The coordinates are loadings. If that is true, check that sum of the squared coordinates onto a PCs must be the variance (squared st. dev.) of that PC. — ttnphns, Jun 10 '14 at 07:06
If the coordinates are _eigenvectors_ then their sum of squares for each PC is 1. [Read](http://stats.stackexchange.com/a/35653/3277) about what are eigenvectors and loadings. — ttnphns, Jun 10 '14 at 07:09
@ttnphns, I have calculated the sum of the squared coordinates of the 6 points and it's about 1. So I think the coordinates are eigenvectors — user2157668, Jun 11 '14 at 19:58
@ttnphns if the coordinates are eigenvectors, are the values on the axises eigenvalues? — user2157668, Jun 12 '14 at 14:26
@ttnphns i think the plot is wrong.. the eigenvalues are pca\$sdev^2 which is the square of (3.4418, 0.20675, 0.13369 ...) and I think somehow the values on the axises are not the projection of each individual onto eigenvectors, which is what I want... they are from pca$rotation in prcomp package in R and they are loadings? but the sum of the squared coordinates is 1.. they seem contradictory... — user2157668, Jun 12 '14 at 14:37
@ttnphns in the prcomp site (http://stat.ethz.ch/R-manual/R-patched/library/stats/html/prcomp.html), it says rotation is the matrix of variable loadings and the first plot took values from the first two columns of rotation so I think they are indeed loadings and the sum of squares is 1. Right? — user2157668, Jun 12 '14 at 21:42
@user2157668: No, this does not make sense. The dimensionality of your space is 5 million, so the eigenvectors of the covariance matrix should be of that length. I guess you should transpose your data variable so that features are in columns! Alternatively, center your data yourself! Default centering by the prcomp function probably does not work, because it centers columns, not rows. This is my guess, as I do not know R. — amoeba, Jun 12 '14 at 23:28
@amoeba yeah, you made a good point. I've read a few more tutorials and they all say the observations/samples should be rows and variables should be columns. They all the above results/figures are just wrong. I also read that they say when the number of samples are smaller than that of variable (which is my case), PCA is technically impossible. So I shouldn't really use PCA I guess? maybe clustering makes more sense. — user2157668, Jun 13 '14 at 16:42
@user2157668: The statement that in your case PCA "is technically impossible" is wrong; it is still possible. I have no idea what the proper way of doing it in R is. However, if "pca$rotation" should contain eigenvectors when the input has variables in columns, it will most probably contain normalized principal components when the input has variables in rows (it's a mathematical trick, see [my answer here](http://stats.stackexchange.com/questions/101344/linear-pca-versus-linear-kernel-pca/101519#101519)). But you need to center the data manually before calling prcomp(). — amoeba, Jun 13 '14 at 17:19
@amoeba right, i guess to make it right, I transposed the original data table and then used center=T in prcomp() and the result is a lot different. Now the first two PCs only explains < 50% of the variance.. — user2157668, Jun 13 '14 at 19:39
@user2157668: Good. Does it solve your problem? Otherwise, I suggest you edit your question. — amoeba, Jun 13 '14 at 19:54
@amoeba I think it solved my problem although i'm not sure how to correctly interpret the meaning of the plots, need to do more reading I guess. some previous questions helped a bit like this one: http://stats.stackexchange.com/questions/62034/are-there-examples-of-more-informative-pca-plots — user2157668, Jun 13 '14 at 22:08
@amoeba thank you very much for the hints and help for the past few days! — user2157668, Jun 13 '14 at 22:09
@user2157668: No problem at all. However, it would be nice to mark this question as resolved, so maybe you could accept one of the answers (I have just provided one summarizing our discussion, but you can certainly accept any other). — amoeba, Jun 14 '14 at 14:22

rolando2 · Answer 1 · 2014-06-09T21:07:28.793

4

Before your two numbered questions can be addressed: Somehow PC1 has a standard deviation listed as 3.4, but all the values are shown as lying between .287 and .290. There must be an error somewhere; this combination of results is not possible. Perhaps you have graphed only of few of many points used in the PCA?

EDIT: I say "not possible" because a standard deviation (here, 3.4) can be no greater than the variable's range (here, .003). And at an intuitive level the SD represents, informally, something like a "typical" deviation from the mean, whereas typical in this graph would be about .0005.

You asked about eigenvalues: they summarize entire variables, whereas what's plotted on your graph are scores for individual points.

edited Jun 09 '14 at 21:07

answered Jun 09 '14 at 20:48

rolando2

11,645
1
39
60

ok, i got what you meant, why is the combination of results not possible..? what are the values of x axis and y axis? are they eigenvalues? – user2157668 Jun 09 '14 at 20:57
http://stackoverflow.com/questions/19993980/prcomp-function-in-r-rotation-values-difference-by-sign this example have similar small rotation values but ~2 standard deviation. – user2157668 Jun 09 '14 at 21:04
Since you referred to YC1CO, etc. as "points" I believed that they were observations/cases/individuals. Are they actually variables? In that case maybe what you've plotted is a set of variables' loadings on the components. – rolando2 Jun 09 '14 at 21:13
...But even then it's hard to imagine how all variables could have PC1 loadings within such a tiny range. – rolando2 Jun 09 '14 at 21:24
sorry I didn't explain them. YC1CO etc are individuals and each has many observations (millions). so i'm trying to see the relationship among individuals with those observations. Those values on the axises are something I don't get either. They are from pca$rotation – user2157668 Jun 09 '14 at 21:31
My advice is not to try to interpret output from someone else's possibly inapplicable procedure, but to exert your own control. Do you want to plot millions of observations? compare individuals? show relationships among variables? Then choose a procedure (or design your own) to accomplish your goal. – rolando2 Jun 10 '14 at 13:07
1

@user2157668: Can you please explain what you mean by "YC1CO etc are individuals and each has many observations (millions)"? Millions of observations -- are these data points or features? What is the original dimensionality of the dataset? How many data points are there in the dataset? If each individual contributes multiple data points, then how do you project individuals on the PC1-PC2 plane? – amoeba Jun 11 '14 at 14:43
@amoeba YC1CO is an individual and it has about 5 million values (numbers) and there other individuals have the same number of values and I have as input a big 5 million rows by 12 columns table. I just run PCA in R with prcomp(data,scale=T, center=T) on the above input and my understanding is each individual is projected based on the sum of all the values with in them. Does this sound OK to you? – user2157668 Jun 11 '14 at 20:08
@rolando2 I don't want to plot millions of observations and yes I want to compare individuals and see their relationship. PCA is a common tool used in the field and it does reduce those millions of dimensions into several.. – user2157668 Jun 11 '14 at 20:10
@user2157668: Yes. So in machine learning jargon, you have 5 million features and 12 data points; you perform PCA and reduce the dimensionality from 5 million to 2. The problem is that it if PC1 accounts for 99% of the variance, the "spread" of numbers on the PC1 axis has to be quite a bit larger than on the PC2 axis. So either your figure, or your variance computations are **wrong** (likely the figure). That is precisely what rolando said in his reply. – amoeba Jun 12 '14 at 09:41
@amoeba yeah, the comments by ttnphns on the question and my calculation seems to suggest the coordinates for eigenvectors, not loadings. does that mean the coordinate values of each individual are eigenvalues, they are not, right? how do i get the eigenvalues? – user2157668 Jun 12 '14 at 14:25
@user2157668: Ah! I think I have just had an eureka moment and realized what probably had happened: you did not center your data! So the sum of squares on PC1 is large, even though the variance is tiny. However, your last question shows that you are very much confused about the meaning of eigenvalues/eigenvectors/etc in PCA. I suggest you read something basic about PCA, maybe start with the highest voted questions with "pca" tag. – amoeba Jun 12 '14 at 14:36
@amoeba yeah..i don't get PCA yet... but I did center the data, with the command in R: pca – user2157668 Jun 12 '14 at 14:42
@user2157668: Then I have no idea what is going on. Maybe you want to update you question and paste your full R code that you used to process the data and generate the figure. Then somebody who knows R (I do not) might be able to find the mistake. – amoeba Jun 12 '14 at 14:58
@amoeba yeah, i should have put all the info there, just edited it. thanks! – user2157668 Jun 12 '14 at 15:33
@rolando2 i updated my question and the PCA i did was on scaled input data and i tried to set the scale to FALSE and the sdev of PC1 became 1.52. – user2157668 Jun 12 '14 at 21:08

score 2 · Answer 2 · answered Jun 09 '14 at 23:03

I fully agree with what @rolando2 wrote, but let me add a bit to the discussion addressing the questions you posed.

Based on the proportion of variance, I know PC1 explains almost all the variance. I have one question here, does the x axis and y axis values matter at all here? PC1 values are much more closer than those of PC2 values although PC1 is the dominant PC.

It is not completely clear what you mean by matter, but yes, they do play a role. Once the your data is projected onto the PC, these are the new representations of the original data. But it is always tricky to compare in a two-dimensional way along two vectors with such a different standard deviation associated as PC1/PC2. Note that the PC are normalized by their own standard deviation, so a plot on the PC space will not reveal much of the original variation, in case that's what you're looking for.

Can I see the difference among points YC1CO, YC4CO, YC5CO and YM1CO, YM3CO, YM5CO are what drives PC1?

I think it is more correct to say that the difference between these points along the axis that defines PC1 are what let to the creation of PC1 as the most representative vector of the variance.

Hope this helps.

thanks. i was wondering the meaning of those axis values and whether bigger means higher variance... so in my plot, x axis values are much smaller than y axis although x axis is the first principle component — user2157668, Jun 11 '14 at 20:14
Indeed, bit to understand this one has to remove the normalization effect knowing the variance along the PC's. — pedrofigueira, Jun 12 '14 at 00:15
@pedrofigueira: The variance of the e.g. PC1 projection *according to the OP figure* is definitely << 1, so I don't think it was normalized. — amoeba, Jun 12 '14 at 09:45

score 1 · Accepted Answer · answered Jun 14 '14 at 14:20

The confusion has already been clarified in the comments above, but I would like to provide an answer so that this thread can be closed.

R's function prcomp() takes as an input a data matrix with variables in columns. In your example you have variables in rows, which results in center=T argument not working correctly (it centers columns, not rows) and renders PCA invalid. The solution is to transpose the data matrix.

Problem with PCA in R (suspiciously high explained variance)

3 Answers3