How normal should variables be to run PCA?

Question

I am new to statistics and it this the first time I am trying to normalise variables. Thus, sorry for my incompetence.

My goal is to classify the landscape by cluster analysis (using kmeans in R) and see if there is a relationship between species distribution and type of the landscape.

However, some of the variables (totally there are 10 variables) I am using are strongly correlated. Thus, I decided to do PCA and run cluster analysis with principal components. I tried to do it without normalising variables before PCA (however, I standardised them with data.frame(scale(data,center=T,scale=T)) function), but the results I got didn't impressed me, since, I observed stronger and more interpretable relationship, when I tried to run cluster analysis with 4 uncorrelated and hypothetically the most important variables (also standardised). So now I want to normalise the variables and try to rerun PCA and cluster analysis. But to make matters worse, some of the variables I am using are far from normal, the sample sizes are large (n=40038) and I've got no experience in transforming the data.

I have read that tests of normality are useless in cases of large samples and, generally, for deciding if one is able to use parametric methods for his data. So I am inspecting normality visually and by values of kurtosis and skewness. So for example, I have one, very problematic variable with many zero values which looks like to follow gamma distribution:

I transformed the variable to the power of 0.3 (x^0.3) and got the following results:

Skewness= 0.5006657 Kurtosis= 3.255236

I also tried other transformations and yeo.johnson() function, but none of them produced me a better result. However, I see that the result I have is far from normal. Nonetheless, maybe it is still fair enough approximation of normal distribution for PCA, as this method doesn't have a strict requirement for it? And to rephrase this question - what will happen if I run PCA with that kind of variable, while other variables follow more or less normal distribution, and later use k means to classify principal components' values? Finally, maybe there is a better way to transform this variable?

Have you made a search here on `pca normality` and `k-means normality`? — ttnphns, Jun 02 '17 at 11:32
Have you tried centering AND scaling or PCA from correlation matrix? Maybe try `prcomp(x, center = TRUE, scale. = TRUE)` ? — gunakkoc, Jun 02 '17 at 11:41
I made the search and used scale(x,center=T,scale=T) function. — Liudas Daumantas, Jun 02 '17 at 11:53

How normal should variables be to run PCA?

0 Answers0