24

I found a reference in an article that goes like:

According to Tabachnick & Fidell (1996) the independent variables with a bivariate correlation more than .70 should not be included in multiple regression analysis.

Problem: I used in a multiple regression design 3 variables correlated >.80, VIF's at about .2 - .3, Tolerance ~ 4- 5. I cannot exclude any of them (important predictors and outcome). When I regressed the outcome on the 2 predictors which correlated at .80, they remained both significant, each predicted important variances, and these same two variables have the largest part and semipartial correlation coefficients amongst all 10 variables included (5 controls).

Question: Is my model valid despite high correlations? Any references greatly welcomed!


Thank you for the answers!

I did not use Tabachnick and Fidell as a guideline, I found this reference in an article dealing with high collinearity amongst predictors.

So, basically, I have too few cases for the number of predictors in the model (many categorical, dummy coded control variables- age, tenure, gender, etc) - 13 variables for 72 cases. The Condition Index is ~ 29 with all the controls in and ~23 without them (5 variables).

I cannot drop any variable or use factorial analysis to combine them because theoretically they have sense on their own. It is too late to get more data. Since I am conducting the analysis in SPSS perhaps it would be best to find a syntax for ridge regression (although I haven't done this before and interpreting the results would be new to me).

If it matters, when I conducted stepwise regression, the same 2 highly correlated variables remained the single significant predictors of the outcome.

And I still do not understand if the partial correlations which are high for each of these variables matter as an explanation for why I have kept them in the model (in case ridge regression can't be performed).

Would you say the "Regression diagnostic: identifying influential data and sources of collinearity / David A. Belsley, Edwin Kuh and Roy E. Welsch, 1980" would be helpful in understanding multicollinearity? Or might other references be useful?

Scortchi - Reinstate Monica
  • 27,560
  • 8
  • 81
  • 248
Ander
  • 281
  • 1
  • 3
  • 5
  • 3
    For an explicit example of this situation, see the analysis of 10 IVs at http://stats.stackexchange.com/a/14528. Here, *all* the IVs are strongly correlated (around 60%). But if you excluded all of them, you wouldn't have anything left! Often it's the case that you cannot drop *any* of these variables. This makes the T&F recommendation untenable. – whuber Sep 27 '12 at 16:20
  • 1
    Indeed, there are a number of pronouncements in Tabachnick and Fidell that I'd regard as at least somewhat dubious ... just because something is printed in a book doesn't mean it always makes sense. – Glen_b Feb 26 '15 at 23:43

1 Answers1

28

The key problem is not correlation but collinearity (see works by Belsley, for instance). This is best tested using condition indexes (available in R, SAS and probably other programs as well. Correlation is neither a necessary nor a sufficient condition for collinearity. Condition indexes over 10 (per Belsley) indicate moderate collinearity, over 30 severe, but it also depends on which variables are involved in the collinearity.

If you do find high collinearity, it means that your parameter estimates are unstable. That is, small changes (sometimes in the 4th significant figure) in your data can cause big changes in your parameter estimates (sometimes even reversing their sign). This is a bad thing.

Remedies are

  1. Getting more data
  2. Dropping one variable
  3. Combining the variables (e.g. with partial least squares) and
  4. Performing ridge regression, which gives biased results but reduces the variance on the estimates.
kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Peter Flom
  • 94,055
  • 35
  • 143
  • 276
  • 1
    Tabachnick and Fidell wrote a nice multivariate book for social science. They are not statististicians but their knowledge of multivariate is preety good. But I think they may create rules of thumb to simplify and could miss statistical subtleties. So I would rely more on what Peter says in his answers than in thier paper. – Michael R. Chernick Sep 27 '12 at 10:54
  • Thanks @MichaelChernick . I actually wrote my dissertation on collinearity diagnostics for multiple regression. – Peter Flom Sep 27 '12 at 10:56
  • I assume that you are as old as me and therefore your work came after the work of Belsley, Kuh and Welsch and Cook. I know Cook's work was mostly on other diagnostic issues (leverage and non-normality), but did he do anything on multicollinearity? Of course the concept of ridge regression even goes back before my time – Michael R. Chernick Sep 27 '12 at 13:21
  • Peter has given you even more reason to trust his answer. I did not know he was an expert in multicollinearity. – Michael R. Chernick Sep 27 '12 at 13:22
  • Hi @MichaelChernick. I am 53. Belsley wrote "the book" on collinearity. [http://www.amazon.com/Regression-Diagnostics-Identifying-Influential-Collinearity/dp/0471691178/ref=sr_1_1?ie=UTF8&qid=1348752233&sr=8-1&keywords=belsley](Belsley, Kuh and Welsch) covers collinearity but it is an earlier book by just [http://www.amazon.com/Conditioning-Diagnostics-Collinearity-Weak-Regression/dp/0471528897/ref=sr_1_9?ie=UTF8&qid=1348752233&sr=8-9&keywords=belsley](Belsley) that I used for my dissertation, which I started working on in 1996 or so. I don't know if Cook did anything on colinerity or not – Peter Flom Sep 27 '12 at 13:27
  • I am 65. So before my time goes pretty far back. I got my PhD in 1978 but actually started using statistics in my first full-time job in 1969. – Michael R. Chernick Sep 27 '12 at 15:29
  • Thank you for the answers! The Condition Index is ~ 29 with all the controls in and ~23 without them (5 variables). I conducted stepwise regression, the same 2 highly correlated variables remained the single significant predictors of the outcome. I do not understand if the partial correlations which are high for each of these variables matter as an explanation for why I have kept them in the model . Would Regression diagnostic: identifying influential data and sources of collinearity / David A. Belsley, Edwin Kuh and Roy E. Welsch, 1980" be helpful in understanding multicolinearity? – Ander Sep 28 '12 at 05:07
  • It's a relatively technical book, but it probably would be helpful. In your situation you need to use a lot fewer predictors. Otherwise, results will be messy. Stepwise by the way, is not a good method of variable selection. – Peter Flom Sep 28 '12 at 10:03
  • I know stepwise is not very good, just used as an extreme confirmation possibility to see if the same predictors remain in the model. Usually I perform hierarchical, this analysis required only enter. I performed ridge regression and the beta weights have changed, increased for the rest of the predictors and are significant. The question is if this a good enough method for a report .. – Ander Sep 29 '12 at 15:13
  • 1
    @Peter Flom: Why is correlation neither a necessary nor a sufficient condition for collinearity? Are you referring to non-linear correlation? – Funkwecker Oct 14 '16 at 07:27
  • 7
    It's not necessary because, if there are a large number of variables, all pairs can be only slightly correlated yet the sum of them is perfectly colinear. It's not sufficient because there are cases where fairly high correlation does not yield troublesome collinearity per condition indexes – Peter Flom Oct 14 '16 at 11:20
  • Could you point me in the direction of resources regarding your suggestion (3), combining variables, e.g., using partial least squares? – altabq Jun 22 '20 at 16:48
  • Since I use SAS, I find their documentation particularly helpful. See [here](https://documentation.sas.com/?docsetId=statug&docsetTarget=statug_pls_syntax01.htm&docsetVersion=15.1&locale=en) It also has references at the end. – Peter Flom Jun 23 '20 at 13:04