1

I am running a multiple regression of Y~a+b+c+d etc...

I want to do a quick check to see whether my different explanatory variables are colinear (they're a mix of categorical and continuous). There seems to be a whelm of complicated statistics behind all this -- would looking at the R-squared value of a simple regression between each variable in turn (a~b, a~c, a~d etc) be a satisfactory coarse estimate of colinearity? Or would I have to use VIF statistics and other more complicated methods?

I hope that this is not off-topic... :-)

Edit: Also, if a simple look at the R-squared value is sufficient, what is an acceptable amount of correlation? I have two variables that are 54% correlated for example - this seems high to me...

whuber
  • 281,159
  • 54
  • 637
  • 1,101
Sarah
  • 1,137
  • 4
  • 12
  • 27
  • As you can learn from some of our questions sharing your tags, $R^2$ has nothing to do with collinearity among the explanatory variables. Why don't you look over the results of a search for [VIF](http://stats.stackexchange.com/search?q=VIF) and see whether they answer your questions. – whuber Mar 05 '13 at 15:50
  • Perhaps where I was getting confused was in assuming that if two explantory variables are correlated that they necessarily explain the same particulars of variation in the dependent variable (i.e. are multicollinear), when actually this isn't necessarily the case? I don't know... Perhaps my statistical knowledge is too low to use this site, haha! – Sarah Mar 05 '13 at 16:32
  • No, that is not necessarily the case, Sarah: correlation among explanatory variables might have nothing to do with how they are related to the dependent variable. This is worked out in detail in several threads; one I remember (because I posted an answer) is at http://stats.stackexchange.com/questions/28474/how-can-adding-a-2nd-iv-make-the-1st-iv-significant. – whuber Mar 05 '13 at 16:35
  • Sorry to ask another question here whuber - I don't have a high enough reputation to comment on other posts. The other posts were very useful - I just wanted to clarify: is the output I am interested in from R: GVIF^(1/(2*Df)) ? I am assuming this is the case as it takes account of the number or parameters, but I wanted to make sure... Thank you very much once again. – Sarah Mar 05 '13 at 17:51
  • Sarah, I don't know what that is the output *from*, so I'm not really sure what `GVIF` is, nor have you made it clear how you intend to interpret this. – whuber Mar 05 '13 at 18:01
  • Sorry, it's one of the column titles from the vif package via car in R, where I've checked the multicollinearity for my dataset. I will scour the internet for more info save bombarding you more! – Sarah Mar 05 '13 at 18:03
  • 1
    Sarah - where you said "*if two explantory variables are correlated that they necessarily explain the same particulars of variation in the dependent variable*", it looks as if the response by @whuber is actually dealing with "they explain the variation in the response" when you're really saying "they explain the **same part** of the variation in the response". If they're perfectly correlated, it's necessarily the case that they explain the same part of the variation. (If they're only very highly correlated, generally the second adds little over the first.) – Glen_b Mar 05 '13 at 22:46
  • 1
    That "generally" is not fully correct, Glen: it is possible that the common part to two IVs is orthogonal to the response variable. That's why it's not such a good idea to draw conclusions about regression results based solely on assumptions about collinearity or the lack thereof. – whuber Mar 06 '13 at 16:19

1 Answers1

2

$R^2$ does participate in VIF calculation, but it's not the $R^2$ from the model that involves $y$ but the $R^2$ among the independent variables.

Given a regression mode:

$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3,$

the VIF of the first predictor, $VIF_{x_1}$, is:

$VIF_{x_1} = \frac{1}{(1-R^2_{x1})},$

where $R^2_{x1}$ is the $R^2$ of the regression model:

$x_1 = \gamma_0 + \gamma_1 x_2 + \gamma_2 x_3$

As you can see, if $x_2$ and $x_1$ are highly collinear, then $R^2_{x1}$ will be very high, causing the VIF to be very high as well.

Your method of checking all possible pair-wise correlations among predictors is close, but it does not incorporate the scenario in which more than two predictors are collinear. For instance, if you fit percents of energy from dietary fat, protein, and carbohydrate into the model, multiple correlation test will not find it out, but VIF will pick it up. So, use your method for exploratory purpose and know its limitation.

If you talk about just a pair of continuous predictors having a very high VIF, they usually should be in the vicinity of |r| > 0.9 in order to cause VIF to be bigger than 6, which is a conventional threshold beyond which some investigation should be merited.

Penguin_Knight
  • 11,078
  • 29
  • 48
  • Note--to avoid potential misunderstandings of that first sentence--that $R^2_{x1}$ is *not* an $R^2$ involving $y$: it measures relationships among the independent variables only. (But it directly addresses the question--well done. +1) One more thing: rules of thumb for VIF thresholds, in order to be valid, ought to account for the number of IVs. The more IVs there are, the more you should be willing to tolerate higher VIFs. – whuber Mar 05 '13 at 21:30
  • @whuber Thanks for the comment. I have revised the first sentence as suggested. – Penguin_Knight Mar 05 '13 at 22:05
  • Thank you Penguin and whuber. This has clarified a great deal for me. – Sarah Mar 06 '13 at 11:26
  • So for clarification I was talking about the R2 amongst the independent variables in my original post. – Sarah Mar 06 '13 at 11:40