4

I have a numerical dependent variable, and many independent variables. Most of my independent variables are dummy variables, but I have some categorical and numerical variables, too. I tried forward and backward model selection in R, but R returns my model empty! When I run separate simple regressions, there seems to be a significant relationship between my independent and dependent variables!

My question is: Am I going to have biased results if I run separate simple regressions with each variable, and then run a multiple regression with all the significant variables?

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
FnewatR
  • 43
  • 2
  • 3
    Yes, that will lead to bias. Why do you need to select variables? People often seem to assume that this is just required for some reason, but it's not clear that it ever is. 2nd, what do you mean that, "R returns my model empty"? Are you saying that the null model (no predictors) is selected? – gung - Reinstate Monica Jun 02 '20 at 17:00
  • because some of the variables are not significant in predicting the dependent variable! yes, no predictors is selected. – FnewatR Jun 02 '20 at 17:13
  • 2
    Who cares if some of the variables in the model are not significant? No harm will come to you, or your model, if it includes some non-significant variables. – gung - Reinstate Monica Jun 02 '20 at 17:23
  • I understand that. But I want to remove those non-significant variables from my model. Forward and backward selection hasn't helped me with finding out which variables are actually significant. I run multiple regression with fewer variables and I get siginificant results. So, I'm convinced stepwise is not doing justice to my variables. – FnewatR Jun 02 '20 at 17:30
  • 2
    There is no need to remove those non-significant variables from your model. Doing so harms your model, whereas leaving them in does not. Forward & backward selection, simply put, *cannot* help with finding out which variables are actually significant. What variables are significant or not is what is reported in the original full model. – gung - Reinstate Monica Jun 02 '20 at 18:03
  • Thanks for your answer. This was enlightening. – FnewatR Jun 02 '20 at 18:04
  • You're welcome. If it helps you, I can write it up as an 'official' answer here. – gung - Reinstate Monica Jun 02 '20 at 18:05
  • Read from the master: https://statmodeling.stat.columbia.edu/2014/06/02/hate-stepwise-regression/ – Dave Jun 02 '20 at 18:05
  • 1
    @gung-ReinstateMonica, certainly will help others having the same issue. Do that please. Thank you. – FnewatR Jun 02 '20 at 18:07
  • @Dave thank you, I will read this. – FnewatR Jun 02 '20 at 18:08

2 Answers2

7

Yes, that will lead to bias.

The question is: Why do you need to select variables in the first place? People often seem to assume that this is just required for some reason, but it's not clear that it ever is. It may well be the case that there are some variables in the original (full) model that are not significant. But this is just fine. There is no problem if your model includes some non-significant variables. There is no need to remove those non-significant variables from the model. Doing so harms your model, whereas leaving them in does not. Forward and backward selection, simply put, cannot help with finding out which variables are 'actually significant'. Which variables are significant or not is what is reported in the original full model.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
  • 1
    To be fair, "selective inference" is a valid field of study and could be used to report significance of variables in a selected model, depending on how selection is performed. It will not be the case, however, that all selected variables are significant! – steveo'america Jun 02 '20 at 18:17
  • That's reasonable, @steveo'america. – gung - Reinstate Monica Jun 02 '20 at 18:23
  • 1
    Wouldn't we expect removing non-significant variables to improve generalization error by reducing overfitting to non-informative features? – Ryan Volpi Jun 02 '20 at 20:17
  • @RyanVolpi, no. We expect removing non-significant variable to *increase* generalization error by *inducing* overfitting. It may help you to read my answer to: [Algorithms for automatic model selection](https://stats.stackexchange.com/a/20856/7290) – gung - Reinstate Monica Jun 02 '20 at 20:32
  • @gung-ReinstateMonica So in all practical cases it is better to include non-informative features than to implement any method of identifying and eliminating them? – Ryan Volpi Jun 02 '20 at 22:07
  • 1
    @RyanVolpi, I wouldn't go that far. Understand the phenomenon you're modeling. Use your knowledge to pick variables a-priori that are likely to be worthwhile. At that point, if some aren't significant, whatever. – gung - Reinstate Monica Jun 03 '20 at 00:53
  • 1
    Excellent answer but ..... it *sometimes* is. Two reasons that leap to mind are coillinearity and overfitting. But significance is not a good reason. – Peter Flom Jun 05 '20 at 15:45
2

Adding to Gung's excellent answer:

Some reasons that you might need (or want) to eliminate variables

  • Collinearity (although there are other solutions to that problem)
  • Overfitting -- if your don't have enough data. There are various rules of thumb; one common one is that you need 10 observations for every independent variable.

Some specific reasons for keeping nonsignificant variables (beyond what Gung listed)

  • A small effect is interesting. Sometimes theory predicts a large effect and you find a small one. E.g. if you find a tribe of people where men and women are the same height, then sex will show as a nonsignificant variable. But very interesting!
  • It's involved in an interaction. There are very few cases where you want to include an interaction but not the main effects.
  • It mediates an effect
  • It is the main variable you are interested in
Peter Flom
  • 94,055
  • 35
  • 143
  • 276
  • 1
    thank you very much for this! – FnewatR Jun 06 '20 at 16:18
  • 2
    I actually do think I'm dealing with both overfitting and collinearity here. About collinearity, I examined variance inflation factor of the model and some of the predictors had high vif, but other than that, I have only around 365 observations, while I have something about 40 predictors. My actual problem is that I need to know the contribution of each variable to the adjusted R2 value of my model, I found out that if my predictors aren't collinear I can just run individual simple regressions and see what share of the variance is explained by a specific variable. Mine are collinear though – FnewatR Jun 06 '20 at 16:27