Confused about multicollinearity, variable selection and interaction terms

Question

I have run a few tests/methods on my data and am getting contradictory results.

I have a linear model saying: reg1 = lm(weight = height + age + gender (categorical) + several other variables).

If I model each term linearly i.e. no squared or interaction term, and run vif(reg1), 4 variables are >15. If I delete the variable with the highest vif number and re-run it the gifs change and now only 2 variables are >15. I repeat this until I'm left with 20 variables (out of 30) below 10. If I use stepwise directly on reg1 then it does not delete the 'highest vic' factor. I don't understand how it tells me 'what' is linearly dependant on 'what variable' and how (and I cannot seem to find this information despite googling for ages).

Furthermore, when I look at the residual plots, most appear horizontal except a few which are upside down u curved (none of these have high vifs). Does this means a transformation is needed? (I removed outliers, leverage points etc - but now there seem to be more!)

reg2 = lm(weight = (height + age + gender (categorical) + several other variables)^2).

If I run vif on this all of the terms are >500!

What else I have tried (without cutting any variables): (1) The errors seem correlated when i run diagnostics and check with Durbin Waston statistics indicating the model is not linear... however... (2) Box Cox gives lambda = 1 so no transformation is needed. (3) LASSO gives the lowest mallows cp on the full 30 variable model (i.e. least squares) (4) Ridge regression gives lambda = 0 which did surprise me.

I'm getting really confused about this data. To determine a suitable model for weight should I be looking just at linear terms or linear and interaction terms (remember there are 25 variables so there are 30^2 interaction terms)?

When I check which ones are significant in reg2 only 12 predictors and 6 interaction terms seem significant (AIC is lowest with this combination after I run step). Should I just use this 'new model with deleted variables/interaction terms' and do all my tests e.g. stepwise method, LASSO etc or do I do it on the entire model?

I'm getting quite lost in terms of making sense of steps to find a suitable model for weight using the variables.

My final question is once I have the model - how do i test/prove its the best/a decent model?

Any help would really be appreciated.

Perhaps read [this](http://stats.stackexchange.com/questions/20836/algorithms-for-automatic-model-selection) to give yourself some perspective. If you think there's a way of examining *every* potential model to find the "best" one, you're chasing a will-o'-the-wisp. Why only look at linear relations & interactions? - do you think weight's linearly related to height? There's a lot of prior knowledge you doubtless have & could use to build a few sensible models to fit & test. — Scortchi - Reinstate Monica, Feb 24 '14 at 01:13
Thanks for the link! I don't think its linearly related but I cannot find any information on how to transform 'x' variables (imagine you have 30) - only could find box-cox for the response (lambda = 1 which means a transformation is not needed). Any ideas? — Dino Abraham, Feb 24 '14 at 20:16

score 3 · Answer 1 · edited Aug 19 '15 at 13:10

3

Neither vifs nor stepwise tell you what is dependent on what. For that, you want condition indices. In R you can get these from the perturb package using the coldiag function.

There, you first look at the condition index for those that are high (some suggest > 10, others > 30). Then, for those indices, you look at the variables that contribute a large proportion of variance.

EDIT to clarify (from colldiag documentation)

    library(perturb)
    data(consumption)
    ct1 <- with(consumption, c(NA,cons[-length(cons)]))
    m1 <- lm(cons ~ ct1+dpi+rate+d_dpi, data = consumption)
    cd<-colldiag(m1)
    cd

Gives


R version 3.0.2 (2013-09-25) -- "Frisbee Sailing"
Copyright (C) 2013 The R Foundation for Statistical Computing
Platform: i386-w64-mingw32/i386 (32-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

[Workspace loaded from C:/personal/abalone/.RData]

> library(perturb)
> ?coldiag
No documentation for ‘coldiag’ in specified packages and libraries:
you could try ‘??coldiag’
> ls(2)
[1] "colldiag"              "perturb"              
[3] "print.summary.perturb" "reclassify"           
[5] "summary.perturb"      
> ?colldiag
> ct1 <- with(consumption, c(NA,cons[-length(cons)]))
Error in with(consumption, c(NA, cons[-length(cons)])) : 
  object 'consumption' not found
> data(consumption)
> ct1 <- with(consumption, c(NA,cons[-length(cons)]))
> ct1 <- with(consumption, c(NA,cons[-length(cons)]))
> m1 <- lm(cons ~ ct1+dpi+rate+d_dpi, data = consumption)
> cd<-colldiag(m1)
> cd
Condition
Index   Variance Decomposition Proportions
           intercept ct1   dpi  
1    1.000 0.001     0.000 0.000
2    4.143 0.004     0.000 0.000
3    7.799 0.310     0.000 0.000
4   39.406 0.263     0.005 0.005
5  375.614 0.421     0.995 0.995
  rate  d_dpi
1 0.000 0.002
2 0.001 0.136
3 0.013 0.001
4 0.984 0.048
5 0.001 0.814
> print(cd,fuzz=.3)
Condition
Index   Variance Decomposition Proportions
           intercept ct1   dpi  
1    1.000  .         .     .   
2    4.143  .         .     .   
3    7.799 0.310      .     .   
4   39.406  .         .     .   
5  375.614 0.421     0.995 0.995
  rate  d_dpi
1  .     .   
2  .     .   
3  .     .   
4 0.984  .   
5  .    0.814
> cd

Condition
Index        Variance Decomposition Proportions
           intercept ct1   dpi   rate  d_dpi
1    1.000 0.001     0.000 0.000 0.000 0.002
2    4.143 0.004     0.000 0.000 0.001 0.136
3    7.799 0.310     0.000 0.000 0.013 0.001
4   39.406 0.263     0.005 0.005 0.984 0.048
5  375.614 0.421     0.995 0.995 0.001 0.814

The first column is just an identifier. The second is the condition index. The others are the proportions.

The bottom line shows clearly problematic collinearity (375 is >> 30). So, which variables are contributing? ct1 and dpi and d_dpi all have high variance decompositions; all three are contributing. You need to do something about this

The 4th line has a problematic condition index (39) but only one variable is contributing much, so there is not much to do.

edited Aug 19 '15 at 13:10

user49422

173
2
11

answered Feb 24 '14 at 00:21

Peter Flom

94,055
35
143
276

Thanks for this! I managed to run the function but it is unclear how one variable links to another. e.g. if I know weight is collinear how do i find out what other variables its linked to and how they are all linked e.g. weight = 0.5*age + 0.6*gender. – Dino Abraham Feb 24 '14 at 20:15
Look at the table of variance proportions for the condition indexes that are high – Peter Flom Feb 24 '14 at 20:54
The table where the 1st column has 1...25 (no of variables) and then all the variables across the top? And dots everywhere and a few numbers? If it has a number is my interpretation correct? Lets take an example. Age, height, Arm Length, Chest Size. 4x4 matrix. Suppose there's 0.3 in (height, arm length) and 0.7 in (chest size, age) and 0.2 in (chest size, height) - (row #, column #) - I'm assuming symmetry? So height = 0.3*arm length? And chest size = 0.7*age + 0.2*height? Have I understood this correctly? – Dino Abraham Feb 25 '14 at 01:00
No, nothing is multiplied. This is not a regression table and it is not symmetric. Look in the row(s) that have have high condition indexes for variances that are also high, then look up to the top of the column. Those are the variables that are causing collinearity issues – Peter Flom Feb 25 '14 at 10:14
Ah ok. What do I do with the variables causing collinearity issues if I do not know how they are linked to the other variables? This is what is confusing me. Do I delete the one with the highest VIF and repeat the test/deleting until all of them are <10? Then use step or something to cut more variables? – Dino Abraham Feb 25 '14 at 11:29
Not VIF, condition index. There are various things to do when you have collinearity: Delete variables; use ridge regression; get more data; use partial least squares are 4 that come to mind – Peter Flom Feb 25 '14 at 13:50
I used the colldiag code and got a condition index table. The first column is index (I'm assuming = variables), the second is variance and the others are all the variables. I have numbers like 0.741 and 0.673 and then other small numbers like 0.001 and 0.071 in the same column. How do I interpret this please? – Dino Abraham Feb 25 '14 at 17:57
I'm still rather confused how this condition index table tells me whats collinear. What exactly am I looking for? What is considered a 'large number' and how is it better then looking at VIF if it doesn't tell you whats correlated to what? This is what is confusing me quite a bit... can you shed some light on this please? – Dino Abraham Feb 25 '14 at 18:07
It doesn't tell you what's correlated with what (you could get that from a correlation matrix) it tells you what variables are contributing to the collinearity. Read each row separately. Look for rows with large condition indexes (above 10 or above 30, depending on who you ask). Then, in each row, look for variables that have a high proportion of contribution (some people say above .5, some above .4). Hard to show here, I will update my answer – Peter Flom Feb 25 '14 at 21:57
1

Fantastic example - completely makes sense now! Thank you again. Just one follow up question - say you identify the last row (like in your example) is collinear based on 2-3 other variables. What do you do? Delete the variable with 'index 5' which I'm guessing in this case is d_dpi? (Note: I am running least squares and a ridge regression separately.) – Dino Abraham Feb 26 '14 at 21:43
Also - if its always just one variable >0.5 explaining the variable with a >30 then we can just ignore multicollinearity altogether? – Dino Abraham Feb 26 '14 at 22:54
On your question - opinions differ, sorry. – Peter Flom Feb 27 '14 at 01:23
Cool will apply all the methods you supplied above and see what happens. Thanks again - very helpful post :) – Dino Abraham Feb 27 '14 at 08:05

Confused about multicollinearity, variable selection and interaction terms

1 Answers1