1

A long time i'm using PCA for exploratory data analysis and i was sure that it is Ok if the first principal components explain a high (90% and even higher) percentage of data variance. Recently i've found information that it's not good when the first few principal components explain a such high percentage of variance and it may be an analysis artefact because of dominance of the several variables.

Could you clarify for me please which percent of explained variance by the first principal components (e.g. 2-3 ones) is apropriate in PCA analysis and which percent may indicate the presence of dominant variables in the data?

Denis
  • 439
  • 2
  • 9
  • 1
    There is no general answer, because it depends on the data and the purpose, as well as what you might mean by "dominant." If you have an application in mind, please include descriptions of those things in your post. – whuber Feb 25 '21 at 21:19
  • I've just read about this phenomenon in the book `Practical Statistics for Data Scientists` written by `Peter Bruce and Andrew Bruce`. And i was wondering if my initial understanding about percentage of explained `variance` was wrong? – Denis Feb 25 '21 at 21:48
  • 1
    I can't say, because it depends on what you mean by "OK." If all you need to do is understand the overall magnitude of a collection of variables, that might be "OK;" but for other purposes you will need to analyze more principal components in order to achieve a useful result. PCA isn't really about "explaining variance;" that's just suggestive language used to describe the mathematics. PCA is performed primarily as a way to help simplify and understand useful patterns in complex multivariate data, often in cases where variance is of no interest at all. – whuber Feb 25 '21 at 22:55
  • Thanks for your reply! Under "OK" i mean to produce correct results (e.g. correctly identify the hidden patterns in the data). There is undesirable situation when variables with a high variance will get extremely large `loadings`, which in turn could introduce bias in `PCA` analysis. In the mentioned above book as a potential way to avoid such situations was suggested to contstruct `Screeplot` and carefully eximine a percentage of explained `variance`. I hope now it would be clear, what i'm trying to clarify. – Denis Feb 26 '21 at 10:24
  • 1
    This is complicated. See https://stats.stackexchange.com/questions/50537 and https://stats.stackexchange.com/questions/53 for instance. Your implied concept of "dominant" is too vague and qualitative to permit any general answer. – whuber Feb 26 '21 at 13:56

0 Answers0