Regularization vs dropping insignificant features

Question

Maybe I did not formulate my question properly. I would like to know how the variable selection differs between regularization and reverse-variable selection based on significance values.

In short, I was wondering which model performs better in general. I can imagine that I can have some insignificance feature that is kept during regularization.

Thanks

possible duplicate of [Why does the Lasso provide Variable Selection?](http://stats.stackexchange.com/questions/74542/why-does-the-lasso-provide-variable-selection), also see http://stats.stackexchange.com/questions/86973/why-does-not-ridge-regression-perform-feature-selection-although-it-makes-use-of?rq=1 — rapaio, Feb 05 '15 at 11:34
Dear Rapaio, I am sorry for stating my question incorrectly. I have reformulated my question. — Snowflake, Feb 05 '15 at 11:57
I asked this before: http://stats.stackexchange.com/questions/127444/a-guide-to-regularization. I read the relevant chapter in those books and they explain the techniques really well but they don't really give much of an indication as to which is better. Then there is this guy: http://scott.fortmann-roe.com/docs/MeasuringError.html who prefers CV but I have seen the opposite view taken as well... — Dan, Feb 05 '15 at 11:59
One difference is that regularization shrinks the parameter values towards zero while variable selection does not do that. Thus regularization is intentionally introducing bias while cutting the variance. — Richard Hardy, Feb 05 '15 at 12:00
@RichardHardy but by dropping variables as in selection you are intentionally introducing bias too (and I'm pretty sure also cutting variance) — Dan, Feb 05 '15 at 12:02
One other thing to keep in mind is that the forward/backward/stepwise selection methods are greedy algorithms where regularization is not. On the other hand, regularization is parametric where using selection coupled with something like AIC is non-parametric. I don't know how this affects which scheme performs better in general though... — Dan, Feb 05 '15 at 12:06
@Dan, good remark. What I had in mind is the following situation: suppose $y$ is generated as $y=x_1+\varepsilon$, but you don't know that. What you have is data on $y$, $x_1$ and $x_2$, and you don't know whether $x_1$ or $x_2$, or neither, or both are relevant. Then with variable selection you will hopefully select $x_1$ and estimate the original regression without bias. Meanwhile, you will not be able to recover the original regression and estimate it without bias using, e.g., ridge regression because $x_2$ will be kept in and the coefficient on $x_1$ will be shrunk towards zero. — Richard Hardy, Feb 05 '15 at 12:18
@Dan, but perhaps this example is not very relevant in practice. I guess it depends on the context. In cases with many regressors in the original data generating process of $y$ that *all* are somewhat important both variable selection and regularization will introduce bias. — Richard Hardy, Feb 05 '15 at 12:19
Remember the big picture: variable selection without shrinkage will result in many features being selected just because their importance is over-estimated. So ordinary variable selection based on $P$-values, AIC, etc. is usually very dangerous. — Frank Harrell, Feb 05 '15 at 13:24
"In general" is the trickiest part of this question. I don't think any method out-performs all others in *all* situations & by *all* definitions of performance. (Hastie's [original LASSO paper](http://statweb.stanford.edu/~tibs/lasso/lasso.pdf) gives one example where all-subsets out-performs LASSO.) So perhaps it merits an answer. — Scortchi - Reinstate Monica, Feb 05 '15 at 16:54
@RichardHardy: Not sure what sense it makes to talk about lack of bias *conditional* on selecting the right predictor. But there's something to the idea that variable selection without shrinkage has low bias when predictors' effects are large *or* zero & the signal-to-noise ratio is high enough that the right ones are reliably picked. — Scortchi - Reinstate Monica, Feb 06 '15 at 21:59

Regularization vs dropping insignificant features

0 Answers0