2

It's the conventional wisdom that a PCA transformation can cure multicollinearity. Putting this into practice on example data, I find myself confused. In the following case, applying PCA seems to have made the multicollinearity (as measured by VIF) much worse!

library(regclass)
library(car)
library(tidymodels)

data("WINE")

##Dropping rating variable

data<-WINE[,-1]

vif(lm(alcohol~.,data=data))

If we are going by the conservative benchmark that VIF of 2 or greater presents multicollinearity that needs to be removed, we need to do something to remove it. enter image description here

Let's apply PCA.

##3 Varibles with vif over 2, would like to remove that multicollinearity. Let's try pca.

withpca<-recipe(alcohol~.,data=data) %>% step_pca(all_predictors()) %>% prep() %>% juice()



vif(lm(alcohol~.,data=withpca))

#multicollinearity is now much worse!

What has gone wrong? The VIFs for PC1 and PC4 are now astronomical! Can someone explain whether I've misunderstood the conventional wisdom about PCA and multicollinearity or whether that conventional wisdom is just wrong?

enter image description here

  • Your example is too complicated to be useful for studying the effect of PCA in regression. As a result you make programming errors: the new VIFs indeed will all equal $1$ (subject to the effects of floating point imprecision, which can be large in the last few PCs). – whuber Feb 01 '22 at 15:19
  • 1
    I don't think complexity is the problem, actually. The data set is real, simple, the code is relatively simple...But some applications of PCA scale automatically, some don't. I see now that this application does not automatically scale the data. – curiositasisasinbutstillcuriou Feb 01 '22 at 16:09
  • The code does a *lot* of things that are irrelevant to the question. This makes it more difficult to diagnose and explain all the things that might be unexpected. – whuber Feb 01 '22 at 16:45
  • It drops a variable, it finds the vifs of the lm, and then it applies part of the tidymodels package for pre-processing (PCA, scaling). None of that is actually irrelevant. I get that an even simpler approach could help for teaching purposes...But that was not the genesis if this post; this was the code that actually led to the error. – curiositasisasinbutstillcuriou Feb 01 '22 at 17:31
  • That's not a good way to solve a problem, though, nor is it a good way to ask for help with a problem. Please see https://stackoverflow.com/help/minimal-reproducible-example for some guidance. – whuber Feb 01 '22 at 17:39
  • No this was a fine way to solve a problem and it was a fine way to ask for help. What's above is a perfectly adequate minimal reproducible example that anyone with R can generate with 3 packages and 5 lines of basically straightforward code. – curiositasisasinbutstillcuriou Feb 01 '22 at 17:45
  • "Reproducible," perhaps. "Minimal"--absolutely not. It is wise to learn how to minimize one's examples. Among other things, it often leads to *you* discovering the problem. It also generates good will among those you would ask for help. – whuber Feb 01 '22 at 17:46
  • Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/133713/discussion-between-curiositasisasinbutstillcuriou-and-whuber). – curiositasisasinbutstillcuriou Feb 01 '22 at 17:54

2 Answers2

3

It seems that I've made a rookie mistake. It's generally recommended to prep the data before PCA by standardizing it first. Some implementations of PCA automatically do so. This one does not, apparently.

withpca<-recipe(alcohol~.,data=data) %>%step_center(all_predictors())  %>% step_scale(all_predictors()) %>% step_pca(all_predictors()) %>% prep() %>% juice()


vif(lm(alcohol~.,data=withpca))

Now the VIFs look perfect.enter image description here

  • 3
    Centering (and in many cases, scaling) should have nothing to do with multicollinearity. – Richard Hardy Feb 01 '22 at 06:19
  • Ok, did I say that it did? Centering and scaling impact PCA. That part is clear. – curiositasisasinbutstillcuriou Feb 01 '22 at 13:28
  • You said that centering and scaling solved your problem. This is the second sentence of your answer. – Richard Hardy Feb 01 '22 at 13:43
  • I think my meaning is pretty clear--go back and read it. I said that centering and scaling clearly have an impact here on PCA. In fact, in many places (as I'm sure you know), it is recommended that you center/scale your data before performing PCA in (this particular implementation)...The whole point is that PCA does in fact address multicollinearity. – curiositasisasinbutstillcuriou Feb 01 '22 at 13:59
  • I do not think we disagree on what you have written. My comment addresses the validity of that. – Richard Hardy Feb 01 '22 at 14:09
  • OK, can we focus on the VIF changes, then? I'm on here to learn something and I'm willing to listen to whatever you know about this (which is surely more than me). It seems that without prepping the data PCA did not address multicollinearity; with the center/scaling, PCA did address multicollinearity (per VIF). What would be your explanation of what's going on here. Or, if the problem wasn't actually addressed, what should have been done instead? – curiositasisasinbutstillcuriou Feb 01 '22 at 14:13
1

Principal Components Analysis is sensitive to the scaling of the variables (variables measured on different scales and with different variances). Scaling is recommended when computing PCs by singular value decomposition (see stats::prcomp).
Now, compute the principal components for your data with and without scaling and then inspect the loadings for PC1 ('rotation' for stats::prcomp). Without scaling, the variable with the largest variance has the largest loading. With scaling, the loadings will be quite different.
Next, look at the bivariate correlations between the variables and the principal components (best done with scatterplots). That will show you which PCs are strongly correlated with which variables. However, for some data I am looking at, the PCs aren't strongly correlated with one another (they shouldn't) but then bivariate correlations are not necessarily reliable indicators of variance inflation. Also keep in mind that VIFs change as you drop variables.
Finally, you need to consider if the PCs have a meaningful interpretation for communicating the the results of your model. It might be better to use the original variables and drop those with large VIFs in a step wise manner.
I'm (obviously) not a statistician but it seems that you did make a rookie mistake in not scaling the variables for PCA.

stweb
  • 196
  • 8