Understanding why stepwise selecton based on p-values is bad

Question

I am trying to prove myself that stepwise method should not be used. Indeed we are often modeling data likewise at my work. I have recently bought the very interesting book of Frank Harrell (Regression Modeling Strategies). In section 4.3 Variable selection he states the following:

But using $Y$ to compute $P$-values to decide which variables to include is similar to using $Y$ to pool treatment in a five-treatment randomized trial, and then testing for global treatment differences using fewer than four degrees of freedom.

He gave a similar explanation in a post here on CrossValidated but I do not get both (pooling then testing for global differences).

I understand that there is a problem of tests multiplicity but I would like to have a more technical proof or more details regarding these examples.

Nor-really-serious-comment: [stepwise selection](http://stats.stackexchange.com/questions/20836/algorithms-for-automatic-model-selection/20856#20856) is bad, [p-values](http://stats.stackexchange.com/questions/200500/asa-discusses-limitations-of-p-values-what-are-the-alternatives) are bad, so taken together they are double-bad ;) — Tim, Feb 05 '17 at 08:36
"He gave a similar explanation in a post here on CrossValidated" - link please. — Ben, Feb 11 '19 at 22:59

Marco Fumagalli · Answer 1 · 2019-02-12T09:26:18.013

For what it may be worth , I'm trying to give my explaination.

One reason for defining stepwise selection as a bad procedure, it's that at any step the model is fitted using classical least square , i.e unconstrained. If you are planning to do feature selection, it usually means that you are in a scenario where $p>>n$. To find $\beta$ OLS try to invert the matrix $X^tX$ which it's not invertible in this case. So you should prefer method like Lasso, which are OLS constrained.

Second reason: stepwise procedure is suboptimal by definition. Each variable is selected in a greedy way and the algorithm can't simply know if it has found a global optimum or just a local optimum.

I'd add that there's a general problem with feature selection: people forget that if you use your data twice, to perform feature selection and then to carry out any inference on your data you are introducing a substantial bias in your estimation. Read this: http://www.maths.bath.ac.uk/~jjf23/papers/interface98.pdf

Also there's a problem with multiple testing. if you don't correct your hypothesis testing (check bonferroni corrections for example) you end up uncorrectly rejecting a hypothesis test that it is indeed true https://www.stat.berkeley.edu/~mgoldman/Section0402.pdf

A good way to do feature selection that exploits lasso method : https://www.stat.cmu.edu/~ryantibs/journalclub/stability.pdf

Understanding why stepwise selecton based on p-values is bad

1 Answers1

Linked