0

I have a dataset of 150 observations and I want to make a linear model to try to explain relations. I use backwards regression. When I keep all variables (10), I have an adjusted R² value of 0.8076. When I remove the variables with p-values higher than 5% in order from highest to lowest, the adjusted R² value goes up to 0.8189. This leaves me with 6 variables, which is still quite a lot. Adding the $log_{10}$ of one of the variables also improves the R² variable ever so slightly to 0.8259, doing this let's me remove a different variable since its p-value has rised to 16%. Removing more variables drops the adjusted R² value to $\approx$ 0.79xx depending on which variable I remove.

My question is, when should I stop removing variables? As soon as I see the adjusted R²-value drop or should I wait until my adjusted R² value drops below a certain value? Or should I look at something else?

Thanks in advance!

Fooourier
  • 3
  • 2
  • @EdM not really, I'm using backwards regression. All my variables have a significant p-value (<5%). Mostly around $10^{-3}$ and my adj. R² value stays about the same, but drops slightly ($\approx 1.5$%) when I remove a variable. I don't really know if this is small enough to ignore or slightly too big to ignore. – Fooourier May 20 '21 at 21:43
  • 3
    Gelman is opposed to stepwise regression: https://statmodeling.stat.columbia.edu/2014/06/02/hate-stepwise-regression/. This Stack tends to side with him. – Dave May 20 '21 at 21:45
  • 4
    Please do not use any kind of stepwise procedure. There are so many questions, with answers, on this site, and if you do a google search you will find even more resources. There is hardly any debate on this issue. Stepwise should be stopped. – Robert Long May 20 '21 at 21:57
  • @RobertLong tell that to my professor xD, I have to learn what he teaches me. – Fooourier May 20 '21 at 21:59
  • @Fooouriers Sure, please provide his contact details and I will. The *only* reason for teaching stepwise should be to show how bad it is. – Robert Long May 20 '21 at 22:02
  • 1
    @RobertLong My comment before yours should show us to be in agreement, but I would say that there is hardly any debate on this issue among trained *statisticians*. Users of statistics seem to be big fans. I remember someone posting on here that they reviewed for a journal and wrote something like, "Stepwise procedures are invalid," and got back a response from the author, "This is a standard technique in our field [references]." – Dave May 20 '21 at 22:06
  • @Dave That's a very good point :) It's an endemic problem in applied work... – Robert Long May 20 '21 at 22:19
  • 3
    When to stop: before starting – Firebug May 20 '21 at 22:39
  • @Firebug Bravo – Robert Long May 20 '21 at 22:44

1 Answers1

2

RSquare just tells you how your model fits the data you feed it. It does not tell you whether your model makes sense (explanatory) or is it predictive (meaning does it really work). A high RSquare can often be the result of a model that is highly overfit with way too many variables, including weird variable transformation. But, the minute you test it on new data, the model performs poorly.

Given that, check out a few things: a) Does the sign of the regression coefficients of your variables are supportive of the underlying logic of the model. If not, you have to eliminate that variable because it infuses really bad noise into your model.

b) Does your model predict well on new data? Hold out, out-of-sample, cross validation testing... whatever method of new data you want to use is fine.

Here is what I would do... Use as a benchmark the model with the fewer variables that give you a pretty descent R Square. And, you like all the variables from an explanatory standpoint including directional signs that make sense. And, then test any other model with more variables on an out-of-sample basis using any of the techniques mentioned above.

You will often find that the simpler benchmark model with fewer variables performs better in terms of out-of-sample testing. And, that is what matters.

Sympa
  • 6,862
  • 3
  • 30
  • 56