Regularized Logistic Regression: Lasso vs. Ridge vs. Elastic Net

Question

Since I'm relatively new to regularized regressions, I'm concerned with the hughe differences lasso, ridge and elastic nets deliver.

My data set has the following characteristics:

panel data set: > 900.000 obs. and over 50 variables
highly unbalances
2-5 variables are highly correlated.

To select only a subset of the variables I used penalized logistic regression fitting the model: $\frac{1}{N} \sum_{i=1}^{N}L(\beta,X,y)-\lambda[(1-\alpha)||\beta||^2_2/2+\alpha||\beta||_1] $

To determine the optimal $\lambda$ I used cross validation which yileds the following results:

The elastic net looks quite similar to the Lasso, also proposing only 2 Variables.

So my main question is: why do these approaches deliver so different results? According to the Lasso, I only do have 2 variables in the final model and according to the Ridge, I do have 34 variables?

So in the end - which approach is the right one? And why are the results so extremely different?

Thanks a lot!

score 5 · Accepted Answer · answered Aug 12 '17 at 11:04

5

By mean squared error do you mean the Brier score? And for elastic net the plot should be 3-dimensional since there are 2 simultaneous penalty parameters. Don't force $\alpha$ to be 0 or 1.

To answer your question, the lasso is spending information trying to be parsimonious, while a quadratic penalty is not trying to select features but is just trying to predict accurately. It is a fools errand to expect that a typical problem will result in a parsimonious model that is highly discriminating. In addition, the lasso is not stable, i.e., if you were to repeat the experiment the list of selected features would vary quite a lot.

For optimum prediction use ridge logistic regression. Elastic net is a nice compromise between that and lasso.

answered Aug 12 '17 at 11:04

Frank Harrell

74,029
5
148
322

1

@ Frank Harrell: Thank you so much for the quick reply! The mean sqared error comes from the cv.glmnet()-function using the specification type.measure = "mse" - I think it's the Brier score. For the elastic net I choose $\alpha=0.5$. In terms of the AUC on the development set, the lasso model achived 0.863 , whereas the ridge 0.854 scored. On the validation sample the lasso achives 0.887, whereas the ridge socred 0.880 - which does not suppeort the ridge. Might this be due to the fact that I have 1 variable which has a fantastic bivariate AUC of >0.9?-So the lasso incorporates only 1 other... – Jogi Aug 12 '17 at 12:50
... such that I intuitively would say that the regid - icorporateing more than 33 other variables might vanish the predictive power of that single very powerful variable? – Jogi Aug 12 '17 at 12:52
1

You're right, if there is one smoking gun predictor it can be penalized too much with lasso, elastic net, or ridge. These methods are essentially using a Bayesian prior distribution with equal belief in the effects of all variables pre-analysis. Don't judge too much by $c$-index (AUROC) which isn't as sensitive as things based on the log likelihood such as pseudo $R^2$ and likelihood ratio $\chi^2$ statistic. – Frank Harrell Aug 12 '17 at 17:53
Thanks for the 'thanks' but this site uses upvoting for that. – Frank Harrell Aug 13 '17 at 12:56

Regularized Logistic Regression: Lasso vs. Ridge vs. Elastic Net

1 Answers1

Linked