Why would LASSO not shrink irrelevant features to zero?

Question

Assume I have 10 features to predict an outcome and I use LASSO regression. Let's say the RMSE of the test set is 20.
Now, I introduce 5 more features and predict the same outcome, and I also use LASSO. The RMSE of the test set is getting larger, to 80.

Would it be possible? Why would it happen?

LASSO shrinks the parameters to 0. And if the new features could not improve the prediction ability, why would it not shrink the new features to 0 making the new model become the original model? In terms of this, wouldn't it produce a similar RMSE for the test set, not a very different one?

no I didn't. The predictors are count, all non-negative value. — Lazar, May 17 '16 at 14:26
I would give it a shot and see if it helps. If some of the predictors tend to have higher counts than others, it could mess with your coefficient shrinking, even though your predictors' "units" are all the same. The regularization will penalize your predictors with larger coefficients more, and by standardizing you are making sure they are all penalized on equal terms. Also, how are you tuning/selecting your shrinkage paramter lambda? Could be an overfitting issue. — TBSRounder, May 17 '16 at 14:33
but in R package glmnet the standardize =TRUE is default. so it means i always standardize the predictors — Lazar, May 17 '16 at 15:00
ok good. What about the tuning? I would expect your RMSE to go down in your training set after adding the variables, but not necessarily in your test set. — TBSRounder, May 17 '16 at 15:39
It might help if you described the exact procedure you used to train the models (it sounds like you're using glmnet, so the relevant code would be helpful). The situation you describe sounds like you're overfitting to the training set. As TBSRounder pointed out, cross-validation is a good approach to avoid overfitting. — josliber, May 18 '16 at 19:07
**1.** If I am not mistaken, LASSO tends to perform poorly when there is a high degree of multicollinearity between the regressors; ridge regression is more suitable then. **2.** Small samples may also prevent utilizing the full potential of LASSO. While asymptotically some features could be set to zero, in a finite sample anything can happen. See also [When wouldn't I use LASSO for model selection?](http://stats.stackexchange.com/questions/77834/when-wouldnt-i-use-lasso-for-model-selection). — Richard Hardy, May 20 '16 at 18:05
@RichardHardy Not necessarily true. When it comes to the "Data reduction" of penalized regression, LASSO will tend to pick a feature that tends to be representative of many collinear variables whereas Ridge will give an efficient representation of their joint effect using a linear combination of all their effects. It may be compared to selecting a feature with high loadings on the first principal component (LASSO) or generating the first orthonormal basis (RIDGE) in unsupervised learning. — AdamO, May 23 '16 at 18:50
@AdamO, thank you for a clarification. Would you have anything to add to the linked question in my previous comment? — Richard Hardy, May 23 '16 at 19:30

score 2 · Answer 1 · answered May 18 '16 at 19:18

It's true that LASSO encourages sparseness in the sense that $\beta$s which are very close to zero are set to zero exactly. So if the tendency is to maintain features which have large values of $\beta$ this may not lead to better prediction in the sense of RMSE.

A $\beta$ for a newly introduced feature may be very large because that feature has a low prevalence or low variability, so it enhances prediction in a small group of observations that is very different at the sacrifice of loosing predictive accuracy among the masses who were discriminated better by smaller $\beta$s

As an example here is a trivariate relationship between a continuous feature $x$ and a binary feature $w$ with an outcome $Y$. The $x$ effect explains much more of the variability in these data than the $w$ effect despite the $x$ effect being smaller overall than the $w$ effect, even after standardization. Lasso would favor $w$ in a model over $x$ because of its magnitude, but that alone does not suffice to lead to good prediction, we merely select $w$ because it is good at discriminating participants. This is the type of feature LASSO tends to select.

This underscores the importance of using cross-validation to select the tuning parameter in a LASSO model. In one such as this, you would find a much better predictive accuracy by including both effects.

Perhaps your answer could also be relevant to a popular and useful thread [When wouldn't I use LASSO for model selection?](http://stats.stackexchange.com/questions/77834/when-wouldnt-i-use-lasso-for-model-selection). — Richard Hardy, May 20 '16 at 18:06

score 1 · Answer 2 · answered Jan 16 '20 at 18:50

The straightforward answer is that LASSO does not always work as intended. Given its penalty factor LASSO has no way of differentiating between a true causal variable that has a high coefficient and should be selected in your model vs. another variable that has little relationship with Y and has a low coefficient. The LASSO algorithm may randomly and often select the weak variable instead of the strong causal variable. And, that's a big problem. In doing so, the LASSO model not only dismantles the explanatory logic of your original model; it, also typically crashes into making much poorer predictions than your original model.

You can visualize representations of such problems by doing searches for Images and search specifically "LASSO coefficient path" and "LASSO MSE graph." The first set of graphs will show how often LASSO will choose weak variables instead of strong causal ones. The second graph (MSE) will show how numerous LASSO models do a poor job at prediction whereby the best model is actually the original one associated with a penalty factor of zero.

Why would LASSO not shrink irrelevant features to zero?

2 Answers2