1

I have a small dataset of size 60, with 6 features. I want to perform PCA on this data as a class exercise. To that end I've removed outliers indicated by Grubb's test, then imputed missing data using the MissMDA package. Last step before PCA I was going to check for normality in each of the features using Shapiro–Wilk since the same size is so small. (reduced to 58 after outlier removal).

When I check each feature using Shapiro–Wilk for normality I get a p value, > 0.05 I accept as normal, below that as non-normal. PCA requires normality so I assumed I needed to apply a log function to transform the data, or perhaps the R scale function.

But when I try either, and perform Shapiro–Wilk again the p value I get is < 0.05 . I don't suppose an expert here could give me any pointers please?

Version 1 : Base Test

shapiro.test( MyData[,2]  )

    Shapiro-Wilk normality test

data:  MyData[, 2]
W = 0.76413, p-value = 2.822e-08

version 2 : Test scaling data

shapiro.test( scale( MyData[,2] ) )

    Shapiro-Wilk normality test

data:  scale(MyData[, 2])
W = 0.76413, p-value = 2.822e-08

Version 3 : Test log transformed data

shapiro.test( log( MyData[,2] ) )

    Shapiro-Wilk normality test

data:  log(MyData[, 2])
W = 0.93537, p-value = 0.004087

Is is necessary to perform a normalisation step to data before PCA, given, for example, that the prcomp function to calculate PCA in R has a CENTER and SCALE parameter? Is it enough just to set those each to TRUE and just calculate the PCA on the data that would have failed Shapiro–Wilk? Any tips would be greatly appreciated, thank for your time.

Rob
  • 39
  • 1
  • 8
  • 1
    It's Grubbs' test, not Grubb's test, but more interestingly (1) why remove outliers? (2) why do you think normality is needed for PCA? I guess that you're confused by different senses of normalise in the literature. Standardisation of data in the sense of (value $-$ mean) / SD is a good idea for PCA unless features are in the same units and comparable as such. But that is nothing to do with whether data are normally distributed. Unfortunately, that scaling is also often known as normalisation. I can't advise on the specifics of R implementations, which in any case are off-topic here. – Nick Cox Apr 16 '18 at 13:39
  • I'd list the data here. You may get good advice that way. – Nick Cox Apr 16 '18 at 13:40
  • Hi Nick, thanks very much for the feedback, for 1) I thought it best to remove any outlier that might throw off the imputation of the missing values in the dataset. Applying Grubbs' test showed just two, so i thought it was ok to remove them before the imputation. For 2) I've seen it mentioned online in a few places, but I don't have a formal definition I suppose. Definitely there's confusion on my part. Regarding the data, given this is towards a graded assignment I'd probably best keep this to generalities in order to avoid any possible complaint I might get on my side :-/ – Rob Apr 16 '18 at 13:58
  • If you're instructed to remove outliers, I don't want to lower your grade, but I'll signal a view that that is often a bad idea. Grubbs' test in particular is based on the idea that the data should be normal, but if that's wrong then it's just going to lead to bad analyses. – Nick Cox Apr 16 '18 at 14:00
  • The data itself I can't say it should be normally distributed, each observation is a material, each feature is a component of that material. The lecturer has said that if the data is severely non normal then it can have an impact on the results of the PCA calculation. So I guess I'll just use QQ plots/histograms for the visual side, and SW for the significance test as the number of observations are low. – Rob Apr 17 '18 at 06:56
  • PCA is just a transformation. I would be much more worried about nonlinearity than about non-normality as such. The problem with deleting outliers as awkward is that they are often genuine and informative. – Nick Cox Apr 17 '18 at 08:32
  • Hi Nick, I don't see an option to accept a comment as the answer, so if you have time please do so for the first comment and I'll do so. Thanks again for the insight! – Rob Apr 17 '18 at 09:11
  • Oddly I just made the same suggestion in reverse to a commenter who declined to do that. Thanks, but I will leave the comments as they are. I don't think there is a good answer to your question without data and results. – Nick Cox Apr 17 '18 at 09:16
  • Fair enough, and thanks again - I'm sure this won't be my last query on the site, so I'll try to pay it forward. – Rob Apr 17 '18 at 09:25
  • As before, I suggest listing the original data. – Nick Cox Apr 17 '18 at 10:12
  • Does this answer your question? [What is the intuition behind SVD?](https://stats.stackexchange.com/questions/177102/what-is-the-intuition-behind-svd) – kjetil b halvorsen May 16 '20 at 02:42

0 Answers0