2

Since I'm relatively new to regularized regressions, I'm concerned with the hughe differences lasso, ridge and elastic nets deliver.

My data set has the following characteristics:

  • panel data set: > 900.000 obs. and over 50 variables
  • highly unbalances
  • 2-5 variables are highly correlated.

To select only a subset of the variables I used penalized logistic regression fitting the model: $\frac{1}{N} \sum_{i=1}^{N}L(\beta,X,y)-\lambda[(1-\alpha)||\beta||^2_2/2+\alpha||\beta||_1] $

To determine the optimal $\lambda$ I used cross validation which yileds the following results:

enter image description here enter image description here

The elastic net looks quite similar to the Lasso, also proposing only 2 Variables.

So my main question is: why do these approaches deliver so different results? According to the Lasso, I only do have 2 variables in the final model and according to the Ridge, I do have 34 variables?

So in the end - which approach is the right one? And why are the results so extremely different?

Thanks a lot!

Jogi
  • 616
  • 1
  • 6
  • 13

1 Answers1

5

By mean squared error do you mean the Brier score? And for elastic net the plot should be 3-dimensional since there are 2 simultaneous penalty parameters. Don't force $\alpha$ to be 0 or 1.

To answer your question, the lasso is spending information trying to be parsimonious, while a quadratic penalty is not trying to select features but is just trying to predict accurately. It is a fools errand to expect that a typical problem will result in a parsimonious model that is highly discriminating. In addition, the lasso is not stable, i.e., if you were to repeat the experiment the list of selected features would vary quite a lot.

For optimum prediction use ridge logistic regression. Elastic net is a nice compromise between that and lasso.

Frank Harrell
  • 74,029
  • 5
  • 148
  • 322
  • 1
    @ Frank Harrell: Thank you so much for the quick reply! The mean sqared error comes from the cv.glmnet()-function using the specification type.measure = "mse" - I think it's the Brier score. For the elastic net I choose $\alpha=0.5$. In terms of the AUC on the development set, the lasso model achived 0.863 , whereas the ridge 0.854 scored. On the validation sample the lasso achives 0.887, whereas the ridge socred 0.880 - which does not suppeort the ridge. Might this be due to the fact that I have 1 variable which has a fantastic bivariate AUC of >0.9?-So the lasso incorporates only 1 other... – Jogi Aug 12 '17 at 12:50
  • ... such that I intuitively would say that the regid - icorporateing more than 33 other variables might vanish the predictive power of that single very powerful variable? – Jogi Aug 12 '17 at 12:52
  • 1
    You're right, if there is one smoking gun predictor it can be penalized too much with lasso, elastic net, or ridge. These methods are essentially using a Bayesian prior distribution with equal belief in the effects of all variables pre-analysis. Don't judge too much by $c$-index (AUROC) which isn't as sensitive as things based on the log likelihood such as pseudo $R^2$ and likelihood ratio $\chi^2$ statistic. – Frank Harrell Aug 12 '17 at 17:53
  • Thanks for the 'thanks' but this site uses upvoting for that. – Frank Harrell Aug 13 '17 at 12:56