0

I try to understand why normal distribution is an assumption for PCA and what might happens when it is violated. I found one answer on this plattform and a lot of different answers in literature. It varies between "there is no need for normal distribution as long as the data variables are linear correlated " and "normal distribution is essential, if violated PCA can go really wrong". I would be really happy if somebody could explane the connection between normal distribution and PCA.

Concetta
  • 31
  • 5
  • 1
    If you have already found answers to this question (at least 3 it would seem) then please link to them, especially if they are on this forum. Otherwise your question is just blowing in the wind. – Gordon Smyth Jan 15 '19 at 07:58
  • See https://stats.stackexchange.com/questions/32105/pca-of-non-gaussian-data or https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues Both answer your question. – Gordon Smyth Jan 15 '19 at 08:41
  • The short answer is that normality is definitely not a assumption of PCA, but there are some extra interpretations and results that can be derived when the data does happen to be multivariate normal. – Gordon Smyth Jan 15 '19 at 22:16
  • @GordonSmyth Thank you very much for your comments. I read e.g in the first version of Jon Shlens paper about PCA [link](http://www.cs.princeton.edu/picasso/mats/PCA-Tutorial-Intuition_jp.pdf) that PCA assumes a distribution which can be described by mean and variance alone. That is only true for the Gaussian distribution. As PCA takes only into account the variance of the given data, this sounds pretty convincing to me. – Concetta Jan 17 '19 at 10:32
  • @GordonSmyth On the other hand looking at how PCA is done I can't see a reason why it shouldn't work properly when my distribution is in comparison to a Gaussian distribution a bit skewed or has heavy tails. What seems pretty obvious to me is that the correlation between the observed variables has to be linear. What can happen when the correlation is not linear is nicely shown in the ferris wheel example in Shlens third version of his PCA tutorial [link]( https://arxiv.org/pdf/1404.1100.pdf). – Concetta Jan 17 '19 at 10:33
  • @GordonSmyth An article I found rather interesting is [link]( https://link.springer.com/article/10.3758/s13428-012-0193-1). It states that Principal Components are (naturally) uncorrelated but only independent when the observed data is normal distributed. That is an effect I noticed in the PCA I’ve done with my (slighty skewed and heavy tailed) data. But I can’t grasp the mathematical reasons behind it or in general the mathematical impact of Gaussian distribution on PCA. So if there is a long answer to your short one I would be most grateful. – Concetta Jan 17 '19 at 10:33

0 Answers0