11

I am running a regression model both with Lasso and Ridge (to predict a discrete outcome variable ranging from 0-5). Before running the model, I use SelectKBest method of scikit-learn to reduce the feature set from 250 to 25. Without an initial feature selection, both Lasso and Ridge yield to lower accuracy scores [which might be due to the small sample size, 600]. Also, note that some features are correlated.

After running the model, I observe that the prediction accuracy is almost the same with Lasso and Ridge. However, when I check first 10 features after ordering them by the absolute value of coefficients, I see that there is at most %50 overlap.

That is, given that different importance of features were assigned by each method, I might have a totally different interpretationbased on the model I choose.

Normally, the features represent some aspects of user behavior in a web site. Therefore, I want to explain the findings by highlighting the features (user behaviors) with stronger predictive ability vs weaker features (user behaviors). However, I do not know how to move forward at this point. How should I approach to interpreting the model? For example, should combine both and highlight the overlapping one, or should I go with Lasso since it provides more interpretability?

renakre
  • 755
  • 1
  • 9
  • 25
  • 3
    (+1) Regularization can be seen as making individual coefficient estimates worse while improving their collective performance at predicting new responses. What precisely are you trying to achieve with your interpretation? – Scortchi - Reinstate Monica Mar 14 '17 at 11:26
  • 1
    @Scortchi thanks for responding. I added this `Normally, the features represent some aspects of user behavior in a web site. Therefore, I want to explain the findings by highlighting the features (user behaviors) with stronger predictive ability vs weaker features (user behaviors) .` – renakre Mar 14 '17 at 11:40
  • 3
    +1 AFAIK the relation between ridge coefficients and lambda doesn't have to be monotonic, while in lasso it is. Thus, at certain shrinkage levels absolute value of coefficients in ridge and lasso may vary a lot. Having said that, I would appreciate if someone can sketch a proof of this or shortly explain it mathematically – Łukasz Grad Mar 14 '17 at 11:43
  • Make sure you are sorting the "beta" coefficients. See http://stats.stackexchange.com/a/243439/70282 You can get them by training on standardized variables or by adjustment later as described in the link. – Chris Mar 14 '17 at 12:57
  • I wonder if this applies https://en.wikipedia.org/wiki/Simpson's_paradox – Chris Mar 14 '17 at 13:03
  • 1
    @ŁukaszGrad LASSO coefficients need not be monotonic functions of $\lambda$ if predictors are correlated; see figure 6.6 of [ISLR](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20First%20Printing.pdf) for an example. – EdM Mar 18 '17 at 15:18
  • On a side note, you could try elastic net. – Richard Hardy Mar 24 '17 at 12:56
  • @RichardHardy thanks for your answer. Can you please give more information regarding how elastic net could help this? I appreciate it! – renakre Mar 24 '17 at 12:57
  • It is a middle way between lasso and ridge. It uses a weighted combination of $L_1$ and $L_2$ penalties. I am not sure it could help much with interpretation, but it could achieve higher forecast accuracy. Also, looking at what relative weight for $L_1$ vs $L_2$ optimizes performance, you could probably get some additional insight. – Richard Hardy Mar 24 '17 at 13:00
  • @renakre how are you measuring predictive performance? – bdeonovic Mar 24 '17 at 13:23
  • @bdeonovic what do you exactly mean? I am using Mean Absolute Error for measuring the performance, would this answer your question? Thanks for interest! – renakre Mar 24 '17 at 14:12
  • @renakre you can measure predictive performance by fitting model with data, and then using same data to measure how well your model fit. This will result in a biased estimate since you used the data to fit model AND same data to estimate performance. Better is to use $k$-fold cross-validation (hey look thats what this place is called!) where you split data into $k$ parts, fit on the first $k-1$ parts and test performance on last part, keep doing that till you have tested on each of the $k$ parts and average over that. – bdeonovic Mar 24 '17 at 14:42

1 Answers1

7

Ridge regression encourages all coefficients to becomes small. Lasso encourages many/most[**] coefficients to become zero, and a few non-zero. Both of them will reduce the accuracy on the training set, but improve prediction in some way:

  • ridge regression attempts to improve generalization to the testing set, by reducing overfit
  • lasso will reduce the number of non-zero coefficients, even if this penalizes performance on both training and test sets

You can get different choices of coefficients if your data is highly correlated. So, you might have 5 features that are correlated:

  • by assigning small but non-zero coefficients to all of these features, ridge regression can achieve low loss on training set, which might plausibly generalize to testing set
  • lasso might choose only one single one of these, that correlates well with the other four. and there's no reason why it should pick the feature with highest coefficient in the ridge regression version

[*] for a definition of 'choose' meaning: assigns a non-zero coefficient, which is still a bit hand-waving, since ridge regression coefficients will tend to all be non-zero, but eg some might be like 1e-8, and others might be eg 0.01

[**] nuance: as Richard Hardy points out, for some use-cases, a value of $\lambda$ can be chosen which will result in all LASSO coefficients being non-zero, but with some shrinkage

Hugh Perkins
  • 4,279
  • 1
  • 23
  • 38
  • Good suggestions. A good check out be to do a correlation matrix. The non-overlapping variables may be highly correlated. – Chris Mar 14 '17 at 12:59
  • 3
    Good answer! However, I'm not sure it's fair to suggest that ridge universally attempts to impove test performance while not saying the same for lasso. For instance, if the true model is sparse (and in the subset of our predictors), we can immediately expect lasso to have better test performance than ridge – user795305 Mar 14 '17 at 22:47
  • This is the 'bet on sparsity' principle. For instance, see the first plot here: http://faculty.bscb.cornell.edu/~bien/simulator_vignettes/lasso.html – user795305 Mar 14 '17 at 22:51
  • 2
    Comparisons of variable choices (LASSO) and regression coefficients among multiple bootstrap samples of the data can nicely illustrate these issues. With correlated predictors, those chosen by LASSO from different bootstraps can be quite different while still providing similar predictive performance. Ideally, the entire model-building process including the initial feature-set reduction should be repeated on multiple bootstraps to document the quality of the process. – EdM Mar 18 '17 at 15:15
  • *by choosing 4 of these features, with lowish coefficients, or even all of them, again with small, but non-zero, coefficients, ridge regression can low loss on training set* -- ridge regression does not choose variables. Also, for low values of $\lambda$, lasso will choose *all* variables but do some shrinkage, just like ridge. – Richard Hardy Mar 24 '17 at 12:43
  • @RichardHardy fair point. edited a bit to address 'ridge regression does not choose variables' – Hugh Perkins Mar 24 '17 at 13:16
  • Better! Now regarding *Lasso encourages all coefficients to become zero, and a few non-zero.* I suggest *all* $\rightarrow$ *some*. Also, the first two bullet points might suggest a false dichotomy: *both* lasso and ridge attempt to achieve better performance on the test set... – Richard Hardy Mar 24 '17 at 13:24
  • for the second assertion, nuance: for some use-cases, we might encourage sparsity to the extent that this actually reduces performance on the test set, but what we get in return is: sparsity. For example, Ribeiro's "Why should I trust you?" uses LARS to induce sparsity, in order to provide a compact, easy to read interpretable result to the user https://arxiv.org/abs/1602.04938 – Hugh Perkins Mar 24 '17 at 13:28
  • @RichardHardy For the first observation, ok fair enough. I was briefly tempted to argue that technically L1 regularization does itself push all coefficients to zero, and its the MSE loss that pushes away from zero, but such an argument is also true for L2, so I have simply edited the answer, according to your suggestion :-) – Hugh Perkins Mar 24 '17 at 13:32
  • Regarding over-sparsity: OK, if the objective function is not optimal predictive performance but something else, then you can do funny things with both lasso and ridge. Regarding "first observation", I reiterate that there are values of $\lambda$ where *all* of lasso coefficients are nonzero, and that may even make sense in applications (not all data generating processes are sparse). So *all* vs *many/most* vs *some* is an important qualifier. – Richard Hardy Mar 24 '17 at 13:39
  • Fair enough: updated to include nuance on, for some use-cases $\lambda$ will be chosen that results in all coefficients being non-zero, but with some shrinkage applied. – Hugh Perkins Mar 24 '17 at 13:49