6

I'm running PCA on my dataset using r and need some help interpreting the standard deviation results.

Here are the results

> summary(wine1.pca)
Importance of components:
                          PC1    PC2    PC3     PC4     PC5     PC6     PC7    PC8     PC9    PC10    PC11    PC12
Standard deviation     1.7440 1.6278 1.2812 1.03373 0.91682 0.81266 0.75088 0.7183 0.67710 0.54683 0.47704 0.18111
Proportion of Variance 0.2535 0.2208 0.1368 0.08905 0.07005 0.05503 0.04698 0.0430 0.03821 0.02492 0.01896 0.00273
Cumulative Proportion  0.2535 0.4743 0.6111 0.70011 0.77016 0.82520 0.87218 0.9152 0.95338 0.97830 0.99727 1.00000

From what I've read, it is good to pick the number of components which explain 85% or greater of the variation.

Questions

  • Should the class variable be part of the dataframe when performing PCA?

  • How can I find out from these results how many components would give 85% or greater of the variance? Would it be PC5 because the standard deviation if .91 and then drops for PC6 at .81?

enter image description here

birdy
  • 481
  • 8
  • 14
  • 1
    Use the "Cumulative Proportion" field as a guide where to cut. If you want to retain *at least* 85% of variation, you should pick top 7 principal components for this data. – Vladislavs Dovgalecs Apr 03 '15 at 15:52
  • @xeon I've added a screeplot as well to the question. How would what you are saying relate to the scree plot? explained variance (y-axis) is just below 0.5 for top 7 principal components. – birdy Apr 03 '15 at 15:57
  • 2
    It is about *cumulative* proportion, starting from the first to n-th component. You should integrate this plot and then read of the value. If you use this plot, typically a point in the "knee" of the plot is pretty a good choice. – Vladislavs Dovgalecs Apr 03 '15 at 16:03
  • ok, great. Thanks. One last naive question...should the class variable be part of the data when doing PCA? or its best to leave it out. – birdy Apr 03 '15 at 16:06
  • Do you mean include class variable as a feature in the dataset? – Vladislavs Dovgalecs Apr 03 '15 at 16:07
  • yeah, when I performed PCA the classvariable was part of my dataframe in R. Should it be left out? – birdy Apr 03 '15 at 16:09
  • 2
    Unless you want to analyze the variance for this variable, you should not include the label. Additionally, this variable is categorical. You don't want to include this variable as you are analyzing the data, not data+class label. – Vladislavs Dovgalecs Apr 03 '15 at 16:12
  • 1
    @xeon: Your comments above fully answer this question (both subquestions). Consider posting an answer, so that this thread could be settled. – amoeba Apr 03 '15 at 19:49
  • 1
    @amoeba I suspect the OP might be better off by being challenged about "what I read." From this plot it would appear that using enough PCs to account for 85% of the variation would be overfitting the data considerably, perhaps paying too much attention to what eventually turns out to be noise. – whuber Apr 06 '15 at 19:33
  • possible duplicate of [How to interpret this PCA biplot to determine which attributes to pick?](http://stats.stackexchange.com/questions/144702/how-to-interpret-this-pca-biplot-to-determine-which-attributes-to-pick) – Has QUIT--Anony-Mousse Apr 09 '15 at 14:43

1 Answers1

4

From your input, you should use the "Cumulative Proportion" field as a guide how many principal components to keep. You define the percentage of variance and then you select the column (which is also the number of that principal component) which cumulatively accounts the variance you would like to keep. For 85% and more variance on your example, you would need to keep 7 principal components.

Concerning the added plot, it might be more tricky to read it. In order to proceed as described in the previous paragraph, when you are given some percentage to keep, you would first integrate and then read off the value of needed components. Actually you have this information already, this is the very same "Cumulative Proportion" field. Just plot it and you will see.

Finally about the (non)inclusion of class variable into the dataset to be analyzed with PCA. Your intent is to analyze the dataset given some measurements and not the class label. The class label is some additional information (typically posterior). You don't want it to be analyzed together with the dataset. It will be hard to interpret the maximum variance directions if the dataset included also the class variable.

Vladislavs Dovgalecs
  • 2,315
  • 15
  • 18