7

I have a set of predictors in a linear regression, as well as three control variables. The issue here is that one of my variables of interest is only statistically significant if the control variables are included in the final model. However, the control variables themselves are not statistically significant.

Here is how the multicollinearity of all my variables look like (including control variables):

 > vif(lm(return ~ EQ + EFF + SIZE + MOM + MSCR + UMP, data = as.data.frame(port.df)))
       EQ      EFF     SIZE      MOM     MSCR      UMP 
 3.687171 3.481672 2.781901 1.064312 1.438596 1.003408

 > vif(lm(return ~ EQ + MOM + MSCR, data = as.data.frame(port.df)))
       EQ      MOM     MSCR 
 1.359992 1.048142 1.412658 

My variables of interest are EQ, MOM and MSCR, and the control variables are EFF, SIZE and UMP. EQ is only significant if the three control var are included, and becomes insignificant when they are not:

  • Here are the coefficients (1rst row) and t-stats (2nd row) when control variables are included (notice that EQ is statistically significant)

           intercept           EQ          EFF        SIZE         MOM       MSCR          UMP
    [1,] 0.005206246 -0.006310531 0.0001229055 0.004125551 0.007738259 0.00473377 5.838596e-06
    [2,] 1.866628909 -1.746583234 0.0388823612 1.178460997 2.145062820 2.08131100 1.994863e-01
    
  • Now, here is the result of the regression when the control variables are excluded (notice that EQ is NOT statistically significant anymore)

           intercept           EQ         MOM       MSCR
    [1,] 0.007313402 -0.002111833 0.007128606 0.00668364
    [2,] 2.652662996 -0.595391117 2.036985378 2.80177366
    

The problem is that when I include my control variables, all my variables of interest are significant, but my control variables are not.

Which variables should I include in my final model? How should I structure my final model then, given the fact that the model will be used for forecasting?

Thank you,

Mayou
  • 737
  • 3
  • 12
  • 29
  • 2
    Why are some variables "of interest" & some "control"? That is, if the point of the model's to forecast, why aren't they on an equal footing? Will some not be available for forecasting purposes? – Scortchi - Reinstate Monica Sep 06 '13 at 14:31
  • They will all be available for forecasting purposes. Should I then consider them all "of interest"? If that's the case, how would I deal with the fact that 3 of them are insignificant, but their presence affects the significance of another variable? – Mayou Sep 06 '13 at 14:33
  • 1
    While there may be a distinction between control and other variables in your mind, it is vital to realise that neither the statistics nor the software pays any attention to that distinction. – Nick Cox Sep 06 '13 at 14:39
  • True that. The question remains that even assuming all of these variables are "of interest", how should I structure my final model given the statistical significance shown above (that depends on which variables are included)? – Mayou Sep 06 '13 at 14:43
  • @Scortchi The other reason why I set those variables to be "control", is that they seem to mitigate the confounding effect for **EQ**: EQ becomes significant when the controls are included. Still I am confused as to how to deal with their insignificance, if I am going to use the model for forecasting purposes? In other words, how can I explain their presence in the model when I use them to forecast returns? – Mayou Sep 06 '13 at 14:45
  • Then I can't see the point of making distinctions between them. Your question is really the best way to do [tag:model-selection]. The worst way (that ever gets seriously proposed) is to remove all "insignificant" variables in one go just because they're "insignificant". Often a pretty good way is not doing it at all. – Scortchi - Reinstate Monica Sep 06 '13 at 14:54
  • Well, the model-selection part was taken care of using LASSO. All the variables you see above are the ones that were selected using LASSO. However, when tested for significance, some were significant, and some were not. That's where I got confused as to which I should keep in my final model.. What are your thoughts? – Mayou Sep 06 '13 at 14:56
  • You presumably already cross-validated the model to choose the value of the shrinkage parameter. What makes you think the model's still overfit? – Scortchi - Reinstate Monica Sep 06 '13 at 15:07
  • I am not thinking there is overfitting, I am essentially concerned about the "predictive" power of the chosen variables. LASSO tries to maximize the fit, that doesn't mean that it chooses variables that are statistically significant. – Mayou Sep 06 '13 at 15:09
  • 1
    Just read up on LASSO again & forget about "significance". This is like a doctor asking how many leeches to apply after a course of antibiotics. – Scortchi - Reinstate Monica Sep 06 '13 at 15:14
  • That makes sense. Thank you @Scortchi. I have a follow-up question on LASSO. Would you be available to discuss briefly on chat? Thanks! – Mayou Sep 06 '13 at 15:37
  • @Scortchi Good choice of words, but did you mean "like asking a doctor how many leeches to apply..." – Nick Cox Sep 06 '13 at 22:51
  • @Nick: I was imagining an old-fashioned doctor who's just been introduced to the new technique asking advice from his colleagues, but it would probably be clearer the way you put it. – Scortchi - Reinstate Monica Sep 07 '13 at 11:35

2 Answers2

12

One reason to include control variables is precisely because they can affect other variables. In this case, the statistical significance of the control variable is completely irrelevant.

However, you may run into journal editors who disagree.

Peter Flom
  • 94,055
  • 35
  • 143
  • 276
  • Thank you for your comment! So how would you interpret the presence of the control variables? Also, does it make sense to use the model with insignificant control variables to forecast the response variable *return*? – Mayou Sep 06 '13 at 14:30
  • 3
    Interpreting the model is up to you. :-). I don't know what any of the variables actually are, and you are the one who knows the substantive area. The control variables, statistically, are there because you want to, well.... control for them! – Peter Flom Sep 06 '13 at 14:43
  • @PeterFlom Thank you for your answer. It is a great help! Could you also advise what to do if one runs into such journal editors? – PSY Dec 10 '21 at 20:22
2

Just a short comment: your p-values should reflect the number of models you are "trying out". In some ways your approach of trying models with and without subsets of variables is one aspect of p-hacking. Your research question alone (not the data) should determine what is a control variable and what is a variable of interest. Exploratory data analysis is fine as long as you report on all tests that you did.

jank
  • 573
  • 3
  • 10