0

I wonder if I can presume that if higher sum(pca.explained_variance_ratio), better the separation of groups?

I wish to randomly check PCA on 100 samples and I wish to plot only the one with best separation. Is checking the highest value from explained variance ration the way to go?

For example

1.st PCA(3) has a sum variance ratio 0.7

2.nd PCA(3) has a sum variance ratio 0.9

Can I assume 2nd one will give me a better plot?

Thanks!

Pitouille
  • 1,506
  • 3
  • 5
  • 16
  • I wonder whether you interpret the "sum variance ratio" wrongly - not familiar with your terminology, but the first PCA should always come with the highest variance, and the "sum" for the second one may be the cumulative sum of first and second, in which case the second PCA only has a variance percentage of 0.9-0.7=0.2. – Christian Hennig Oct 11 '21 at 12:41
  • Does this answer your question? [PCA and proportion of variance explained](https://stats.stackexchange.com/questions/22569/pca-and-proportion-of-variance-explained) – Firebug Oct 11 '21 at 21:28

1 Answers1

3

PCA does not optimise the separation between the groups, and the variances of the principal components are not normally informative about group separation.

Christian Hennig
  • 10,796
  • 8
  • 35
  • To expand a bit on the second point, you could have a PC1 that explains 10% of the variation yet completely explains the separation between the groups in the data. Conversely, you could have a PC1 that explains 90% of the variation in the data, yet the groups may not be linearly separable in principal component space at all. – alan ocallaghan Oct 11 '21 at 12:50
  • Thank you for your answers! Do you have any idea how to see from scores what separation is the best? – Noob Programmer Oct 11 '21 at 13:50
  • 1
    PCA is not made for this, maybe discriminant coordinates (discriminant functions) may help you, see https://en.wikipedia.org/wiki/Linear_discriminant_analysis#Discriminant_functions – Christian Hennig Oct 11 '21 at 19:36