Stepwise regression predicts as well as Bayesian model averaging and boosting, why?

Question

I used Bayesian model averaging and gamboostLSS for variable selection on real data with 23 covariates. A 10-fold CV was used to asses performance. I tested stepwise regression out of curiosity and found that it predicted almost as well as the other two methods. I know stepwise regression has a lot of short comings so I actually expected it do much worse. I'm not sure how to interpret this. Could there be something wrong with my codes? Is it reasonable to expect stepwise regression to do worse than BMA and gamboostLSS, which are much more solid methods?

My data is relatively small with 660 observations over 11 years.

Any thoughts on the matter?

What is your dataset? Perhaps it has some features that intrinsically favour stepwise? For each method there are theoretical conditions virtually guaranteeing good or bad performance. Probably your type of problem just happens to be stepwise-friendly? — Richard Hardy, Aug 20 '16 at 09:10
@RichardHardy I've thought about this too, but that shouldn't be the case since my DV is the gini coefficient which lies on a closed interval from 0 to 1. That's why I'm using a Beta regression with gamboostLSS. My data set is multiple imputed, so I'm working with 12 datasets(the DV had no missing). — Herzriesig, Aug 20 '16 at 10:15
The main thing is probably the correlation structure of the regressors rather than the data range. Could you post the correlation matrix of the regressors (rounded to 2 decimal values to avoid clutter)? — Richard Hardy, Aug 20 '16 at 10:38
@RichardHardy heres a link to an image of the correlation matrix, [link](http://imgur.com/a/lSJvQ). I did a logistic transformation to the DV, as I initially forgot to do so. As a consequence, the performance of the stepwise reg. dropped drastically. Now Im starting to doubt my codes somewhat, or maybe this is just a good correction displaying more accurate results. What do you think? — Herzriesig, Aug 20 '16 at 13:29
Hey, did you make sure to include the variable selection step **in each fold** of your cross-validation? Apologies if this sounds obvious to you, but I've met people who have been making this mistake again and again, and it leads to a badly underestimated CV error for the stepwise regression. — DeltaIV, Aug 22 '16 at 09:26
@DeltaIV Yea, I believe so. The variable selection procedure was looped 10 times. It's absolutely something to make sure of, so no apologies needed. — Herzriesig, Aug 22 '16 at 15:33

DeltaIV · Answer 1 · 2016-08-20T08:37:19.033

6

I skip the part about the possibility of errors in your codes because you haven't shown the codes. Concerning predictive performance of stepwise regression, you're not the first one to find that in a real world application, it performs better than one would expect: see here. However, there are good theoretical reasons why stepwise regression should tend to overfit the training data set, resulting in poor predictive performance. Thus the fact that on some specific data sets it does a good work, may well be due to chance. Have you addressed the possible high variance of the cross-validation estimator, by using repeated cross-validation? Have you tried other ways to estimate the generalization error, for example by using bootstrap?

Even if the predictive performance happens to be good on a specific data set, I don't see why one would want to use stepwise regression today, when we have LASSO as a great way to estimate a sparse linear regression model (note that gamboostLSS doesn't estimate the linear regression model, but GAMLSS which is a much more complicated model). You cannot reliably make inference with stepwise regression, because the p-values don't have the standard interpretation. You could of course compute perfectly valid p-values for stepwise regression using sample splitting, but you would lose power that way. Instead, we have a significance test for LASSO which uses all the data.

edited Aug 20 '16 at 08:37

answered Aug 20 '16 at 07:22

DeltaIV

15,894
4
62
104

1

I suppose hypothesis testing is not of much interest for the OP, but it's no problem to mention it, of course. Also, suggesting yet another method (LASSO) is tangential to the question, even though it could serve the OP well in general. Also, the main question is **why** stepwise performs as well as BMA and boosting. Could you answer that one directly? Or is the only answer that it is due to chance (which you seem to offer as an explanation)? – Richard Hardy Aug 20 '16 at 09:04
@DeltaIV Thanks for the suggestion. I'll def look into the possibility of using bootstrapping. The BMA and gamboostLSS are quit expensive methods so I haven't tested out LOOCV, which would undoubtedly be interesting. If you were to report these results, would you say that the similar performance of stepwise and the two other methods most likely is due to chance? What's additionally baffling is that the DV is specified as a percentage, which makes the standard regression inappropriate and the expected poor performance of stepwise stronger. I'm using the beta regression with gamboostLSS. – Herzriesig Aug 20 '16 at 10:40
@RichardHardy ok, LASSO is another estimation method, but not necessarily another **model**. Initially the OP didn't explain that s/he was trying to predict the Gini coefficient (a bounded continuous response), so I thought s/he was using BAM & stepwise regression to estimate linear models, gamboostLSS to estimate **nonlinear additive** models. – DeltaIV Aug 21 '16 at 10:47
@RichardHardy Concerning chance, I mean that for some particular data generating processes it could happen that stepwise does a good job at predicting out of sample. In general however [it has many drawbacks](http://stats.stackexchange.com/a/20856/58675), so why use it when you have a better alternative to estimate the same model? – DeltaIV Aug 21 '16 at 10:51
@DeltaIV: Don't get me wrong, I appreciate your answer and your remarks. You are telling a lot of good things there. My point was, can you answer the question directly? I am genuinely curious **why** stepwise did so well this time, which also happens to be the main question in the OP. Your main guess appears to be, due to chance. That is a fine guess, and I am just wondering if you could do even better than that (e.g. construct an example showing when that would happen by design). – Richard Hardy Aug 21 '16 at 10:54
@RichardHardy ah, ok, I got your point now. It would be easy to build an example where stepwise does always a perfect job on training set, and an horrible one on the test set :) but of course that's not what we need here. I will think about it in the next days. – DeltaIV Aug 21 '16 at 12:19
@DeltaIV, sounds good. – Richard Hardy Aug 21 '16 at 12:21
@Nesvold, don't do LOOCV. LOOCV has lower bias, but higher variance than 10-fold cross-validation, thus it might actually give less accurate estimates of the generalization error than k-fold CV. I was suggesting to perform repeated 10-fold (or maybe 5-fold, though I don't expect results to change much) cross-validation. Did you read my link to Peter Ellis' blog? I think he gives a fine example of both repeated k-fold CV and bootstrap applied to stepwise regression, and he also provides R code. – DeltaIV Aug 22 '16 at 09:34
@DeltaIV Yea, I had quick look at the article. It was quit interesting, thanks for the link. I've read some other articles that recommend k-fold CV particularly for model selection. Apparently, LOOCV should be applied when the task is to minimize the risk. I made my own cv function, so I'm not using caret in this case. Maybe the results reflect something wrong in the codes. I've tried out both 5 fold and 10 fold, and your expectation was right, results didn't change much. – Herzriesig Aug 22 '16 at 16:04

score 5 · Answer 2 · answered Nov 22 '16 at 17:26

First, if there exists a decent predictive linear model to be found among your variables, then it's unsurprising one model could do about as well as model averaging or boosting. This seems to be the case for your data.

Next, if predictors in your dataset don't have too many large correlations among one another, stepwise regression has a high chance of finding a close-to-best predictive model. (You shouldn't trust the p-values for hypothesis testing etc., but for pure prediction it may be fine.) In the correlation matrix you posted, it seems you have mostly small correlations, so it's unsurprising that a stepwise model's performance would be about as good as any other.

Finally, cross-validation isn't ideal for model selection if you use a large training:testing ratio, as with 10-fold. If you really want to be sure you chose the "best" model, you'd have to train on smaller splits and test on larger ones, so that the difference between the best and near-best models' performances is measured more precisely. With 10-fold CV, it's unsurprising that all "decent" models show similar estimated performance.

In short, I see nothing wrong with stepwise doing about as well as model averaging or boosting in your case.

Stepwise regression predicts as well as Bayesian model averaging and boosting, why?

2 Answers2