2

Edited Question, since it was a duplicate I used Matlab to make a lasso model for my data that has 41 predictors and 1 response variable, and perhaps I used more variables that I need too or maybe some variables are not meaningful since some regression coefficients are 0.

For the non 0 coefficients, I got some that are 0.2, 0.8, 2.7, etc.

I understood that I can interpret that the higher the regression coefficient higher the importance for that respective variable, is there a Metric or a Rule of thumb that say if a regression coefficient is 10/50/100 times lower than the highest regression coefficient we can "reject" or "not consider" that variable, when implementing the model online? or since it gave a non zero value I should stick with that variable no matter what?

Tiago Dias
  • 45
  • 1
  • 4
  • 2
    This issue of feature importance is tricky and is discussed extensively on this site. [This page](https://stats.stackexchange.com/q/202221/28500) has extensive discussion and links to further discussion. As does [this page](https://stats.stackexchange.com/q/202277/28500). Please look over those pages and their links and edit your question to specify any specific statistical issues that are still unclear to you. Also, please edit your question to spell out the meanings of "RR" and "EN" as not all readers will know what you mean by them. – EdM Apr 16 '19 at 14:59
  • 1
    I know of no such rule. – Peter Flom Apr 17 '19 at 11:31
  • @PeterFlom so if it gave a coefficient > 0, I should stick with that variable, no matter the value? – Tiago Dias Apr 17 '19 at 13:37
  • 2
    It's not good, in statistics, to make universal rules for model building. You have to *think* about what you are doing. But you shouldn't reject a variable just because it is non-significant or has a low value. Model building requires thought. – Peter Flom Apr 18 '19 at 11:53

2 Answers2

2

You have to think carefully about "importance" of selected predictors and what "p-values" really mean in LASSO.

Predictor importance

Demonstrations of LASSO can be based on a simulated data set with a small number of predictors associated with outcome and a large number that are not. In that context it works well to find the truly important predictors.

But in real-world applications, with multiple predictors that are correlated with each other, the choice of "important" predictors will vary from sample to sample from the same population. The variability in "importance" among predictors you saw among re-samples of your data, and that forms the basis of the stability selection method recommended in the answer by @Edgar, should lead to some questions about what "importance" of individual predictors means when there are multiple correlated predictors related to outcome.

Even when LASSO returns a value of 0 for a predictor's coefficient (as it is designed to do), that doesn't mean it's "not meaningful"; it just means that it didn't add enough to the model to matter for your particular sample and sample size. The predictors that were selected might be important within your particular data sample, but that doesn't mean they are the most important in any fundamental sense in the overall population and they certainly can't be interpreted to have causal effects on outcome.

Your particular approach based on ranking of coefficient values is potentially dangerous, depending on how it is done. Predictors are typically standardized before LASSO so that differences in measurement scales don't differentially affect the penalization of the coefficients. But some software then re-scales the coefficients to the original measurement scales. So at the least you have to be careful about whether you are ranking coefficients for standardized or for re-scaled predictors. You don't want the importance of a predictor having a length value to differ depending on whether you measured it in millimeters or miles.

LASSO p-values

In many applications, the most important issue with LASSO is how well the model works for prediction. A strength of LASSO is that, even with its potentially unstable selection among correlated predictors, models can work quite well in practice for prediction. In that context, p-values for individual coefficients are of little interest.

It's when you are interested in inference that p-values matter. This is a very difficult problem in LASSO or in any modeling approach that uses outcomes to select predictors. The usual assumptions for estimating p-values in standard regression models no longer hold when you have used the outcomes to select predictors. There has been some work on this in recent years, introduced for example in Chapters 16 and 20 of Computer Age Statistical Inference. Under some assumptions it is possible to estimate p-values, but I think that it's safe to say this is still an area of active research interest. Unless you are willing to get into these issues in depth, it might be best to stay away from p-values for individual coefficients in LASSO.

EdM
  • 57,766
  • 7
  • 66
  • 187
  • About the correlation of predictors: Meinshausen/Bühlmann claim that the subsetting of the sample helps with letting features "shine" even if there correlated with more prominent features for the whole sample. Also, they add a randomized scaling to features in every run because of possibly correlated features, which I didn't explain here. – Edgar Apr 17 '19 at 14:55
  • @Edgar I agree that, insofar as it makes sense to evaluate importance among predictors in LASSO, the stability selection method is a promising approach. My reason for providing this answer is my fear that the OP or others who come upon this page might not have thought through what feature importance means in practice with multiple correlated predictors. – EdM Apr 17 '19 at 15:02
  • Can I ask what is that parameter that you told that should be at least 0.6? is the lasso tuning parameter? or something I should calculate after the model is built? – Tiago Dias Apr 18 '19 at 08:18
1

If you want to assess the importance of features in the lasso framework, you can use stability selection by Meinshausen/Bühlmann. This means basically that you repeat your lasso $B$ times on a random subset of your data and in every run you check which features are in the top $L$ chosen features. In the end you give a score to every feature how often it was selected in the top $L$ over $B$ runs. The cited paper shows that stability selection is much more stable than simple lasso.

Edgar
  • 1,391
  • 2
  • 7
  • 25
  • 1
    Yes I run lasso 30 runs, and for each run, I order the absolute value of the regression in a ranking of 1 to 41 (41 predictors) to map it, but I was not certain if it is a good methodology. Not sure if I can perform the calculation of the p-value for the regression coefficients for LAsso – Tiago Dias Apr 17 '19 at 13:53
  • 1
    $B=30$ is probably too low, $B=100$ or even $=500$ should be better. The size of the subsample should be around half of the observations, if you don't have many observations you can choose to subsample some more. $L=5$ is the most common choice, this makes lasso run quite fast. – Edgar Apr 17 '19 at 14:50
  • You don't have $p$-values this way, but the stability score should give you a good measure of how important the features are (Meinshausen and Bühlmann say that values above 0.6 are good indicators). Feel free to mark the answer of your choice as a solution to your problem! – Edgar Apr 17 '19 at 14:52
  • BTW you should definitely not order by the size of the regression coefficients (as @EdM pointed out, too). You should order by the entrance of features into the model with Lasso (Lasso adds features one by one, with stability selection, you stop after $L=5$ and go on to the next run). – Edgar Apr 17 '19 at 15:00
  • thanks for the inputs, will try to understand if the stability is something I can analyze in matlab from their built-in functions. thanks – Tiago Dias Apr 18 '19 at 08:08
  • the parameter that must be at least 0.6 is the π_thr? or λ (responsible for the regularization factor)? – Tiago Dias Apr 18 '19 at 08:36
  • $\pi_{thr}$ is the threshold for the stability score. $\alpha$ is the randomization level (in every run you scale features randomly to smaller values, with $\alpha=1$ meaning no randomization and smaller values meaning more randomization). $\lambda$ is the penalty parameter of the lasso regression that determines how many features are included in the model (smaller $\lambda$ meaning more features). As such, $\alpha$ needs to be specified in the beginning, $\lambda$ is a sequence of values that's either provided by you or determined by the implementation, $\pi_{thr}$ is assessed in the end. – Edgar Apr 18 '19 at 08:51
  • ok, in the paper you provided I didn't get how to perform the π_thr calculation for the model i obtained – Tiago Dias Apr 18 '19 at 09:12
  • You don't necessarily need to calculate this parameter. Citing the paper: "The threshold value πthr is a tuning parameter whose influence is very small. For sensible values in the range of, say, πthr ∈ (0.6, 0.9), results tend to be very similar." – Edgar Apr 18 '19 at 09:16
  • now I am confused. So when you told earlier to give a score when X1 is selected for lasso under the 100 runes, you mean a score system of my choosing? – Tiago Dias Apr 18 '19 at 10:12
  • No. I repeat myself: lasso chooses features one after the other along a decreasing sequence of lambda values. With stability selection, you only consider the first L chosen features (q in the paper). This you do B times. The score for feature x1 is then #{times x1 was in first L features}/B. – Edgar Apr 18 '19 at 11:00
  • 1
    Maybe you should intensify your understanding of lasso first before you study more complicated methods that improve on lasso. – Edgar Apr 18 '19 at 11:01
  • Ok, I don't want to improve lasso. Just wanted something so I could say Variable X1 is important X2 is not, X3 is, etc. I guess i will just make the absolute value of regressors, since its the standard way from the literature for my report, since they were obtianed with standardize data. but thanks for your inputs – Tiago Dias Apr 18 '19 at 13:37
  • This only works properly if your features aren't correlated, which they most certainly are. Can you give me a reference where it is stated in the literature that it's ok to take the absolute value of the coefficients? – Edgar Apr 18 '19 at 13:40