PCA: 91% of explained variance on one principal component

Question

I am new to PCA and wanted to do a bit of experimentation on my data set just to see what it looked like (using R). I am not able to give access to the data here since it is confidential. However, if there is some other kind of statistic/visualization you would like to see that would help you answer my questions please let me know and I will provide it.

I found the following information about the explained variance:

Component Prop.Var
1         0.911804348
2         0.033618098
3         0.020827269
4         0.011772988
5         0.006611746
6         0.005372772
7         0.004464788
8         0.003436401
9         0.002091589

This raises the following questions:

Am I justified in removing the other 8 principal components?
How do I interpret 91% of explained variance on one component?
If I only kept one component what would be the best way to visualize the data?

Below is how the graph of the first two principal components looks. The spread of the data like this is not surprising given how little of the variance is on the second component.

Principal Components 1 and 2

As I mentioned, I am new to PCA so I really do not know if there is even any useful information to be found from this kind of dimensional reduction. Any insight would be appreciated.

I'd say, yes, you can discard the other components. You retain 91% of the information, with 10% of the complexity. Your variables are highly correlated, but also appear to be skewed. What's the aim of your analysis? — Jeremy Miles, Apr 30 '14 at 23:51
Look at the correlations between your PCs and the original variables and the correlation matrix of the variables, and think about whether the results make sense. If you are using a covariance matrix, look at that and check that it makes sense, notably that all the variables are measured on the same scale. — Nick Cox, Apr 30 '14 at 23:55
@Jeremy Miles If I keep one dimension, how would I visualize it? I can think of a couple of way but I am not sure what any of these graphs would really tell me. The aim is cluster analysis - I would like to see data points that change in the same way. I realize that PCA may not be the best tool for this but I wanted to see if there was anything interesting I could conclude about the data before going to K-means or something else. — syntonicC, Apr 30 '14 at 23:57
Is this biological data? Or are there other strong confounding factors like race would be in human genetics? If so you may actually want to correct for the largest component/eigenvectors as in the EIGENSTRAT method: http://www.nature.com/ng/journal/v38/n8/abs/ng1847.html — Ryan Bressler, May 01 '14 at 00:49
Definitely take a look at the scale of the data - it may be that one is simply huge (eg, weight measured in grams, vs height measured in kilometers). — chmullig, May 01 '14 at 03:03
This sort of result is common if you have a scaling problem. I second what @chmulig said. — Aaron, May 01 '14 at 03:23
I would definitely try cross check with a [robust PCA approach](http://stats.stackexchange.com/a/33602/603) — user603, May 31 '14 at 18:03
Extending on @Aaron, generally it is advisable to subtract the mean and divide by the standard deviation for your each feature in your dataset. — Jessica Collins, Jun 30 '14 at 22:33

score 4 · Answer 1 · answered May 01 '14 at 01:13

I am (very) new to this, but I'll do my best to help. The answers to your questions are

Am I justified in removing the other 8 principal components?

I do not think you are "justified". But if you want to make a first coarse assessment of the data you can concentrate on the first PC, just bear in mind that you neglect 9% of the total variability. This leads you to ask many other questions: were the variables expected to be so strongly correlated? Could you simulate or explain this 9% extra variability simply by invoking measurement errors?

How do I interpret 91% of explained variance on one component?

You interpret it with a very high degree of correlation between the many variables you included, or between at least two variables while the others show a much smaller dispersion. When you look at the PC components in terms of original measurements, how many significant components do you have?

If I only kept one component what would be the best way to visualize the data?

If you only kept one component your final description of the data would be 1D, so an axis would do the job. I repeat myself, and please do not take my words as patronizing, but I would try to understand if the PC you calculated makes sense given the data.

(+1) Ignoring smaller PCs can't be justified in general, for all purposes, only with respect to some particular subsequent analysis. There's no law of nature that bigger PCs of a set of variables correlate more strongly with other variables from outside the set. See @NickStauner's answer [here](http://stats.stackexchange.com/questions/87198/pca-randomness-of-component/87231#87231). — Scortchi - Reinstate Monica, Aug 01 '14 at 11:32

score 1 · Answer 2 · answered May 01 '14 at 01:13

1

Going by your plotting the first 2 components, I would definitely say keep the second one, and maybe the third one too.

Drawing cumulative information gain graphs also help a lot in deciding about PCA.

If you are using R, there are simple methods to do that. You could look up R labs in standard data mining books like the ones by Tibshirani.

answered May 01 '14 at 01:13

Kaizzen

13
5

plot(cumsum(pve), xlab="Principal Component ", ylab=" Cumulative Proportion of Variance Explained ", ylim=c(0,1)) where pve = proportion of variance explained. – Kaizzen May 01 '14 at 01:16

PCA: 91% of explained variance on one principal component

2 Answers2