11

I got from other posts that one cannot attribute 'importance' or 'significance' to predictor variables that enter a lasso model because calculating those variables' p-values or standard deviations is still a work in progress.

Under that reasoning, is it correct to assert that one CANNOT say that variables that were EXCLUDED from the lasso model are 'irrelevant' or 'insignificant'?

If so, what can I actually claim about the variables that are either excluded or included in a lasso model? In my specific case, I selected the tuning parameter lambda by repeating 10-fold cross-validation 100 times in order to reduce randonmess and to average the error curves.

UPDATE1: I followed a suggestion below and re-ran lasso using bootstrap samples. I had it a go with 100 samples (that amount was what my computer power could manage overnight) and some patterns emerged. 2 of my 41 variables entered the model more then 95% of times, 3 variables more than 90% and 5 variables more than 85%. Those 5 variables are among the 9 that entered the model when I had run it with the original sample and were the ones with the highest coefficient values then. If I run lasso with say 1000 bootstrap samples and those patterns are maintained, what would be the best way to present my results?

  • Does 1000 bootstrap samples sound enough? (My sample size is 116)

  • Should I list all the variables and how frequently they enter the model, and then argue that those that enter more frequently are more likely to be significant?

  • Is that as far as I can go with my claims? Because it is a work in progress (see above) I cannot use a cut-off value, right?

UPDATE2: Following a suggestion below, I have calculated the following: on average, 78% of the variables in original model entered the models generated for the 100 bootstrap samples. On the other hand, only 41% for the other way around. This has to do in great part with the fact that the models generated for the bootstrap samples tended to include much more variables (17 on average) than the original model (9).

UPDATE3: If you could help me in interpreting the results I got from bootstrapping and Monte Carlo simulation, please have a look at this other post.

Puzzled
  • 365
  • 1
  • 2
  • 15

1 Answers1

10

Your conclusion is correct. Think of two aspects:

  1. Statistical power to detect an effect. Unless the power is very high, one can miss even large real effects.
  2. Reliability: having a high probability of finding the right (true) features.

There are at least 4 major considerations:

  1. Is the method reproducible by you using the same dataset?
  2. Is the method reproducible by others using the same dataset?
  3. Are the results reproducible using other datasets?
  4. Is the result reliable?

When one desires to do more than prediction but to actually draw conclusions about which features are important in predicting the outcome, 3. and 4. are crucial.

You have addressed 3. (and for this purpose, 100 bootstraps is sufficient), but in addition to individual feature inclusion fractions we need to know the average absolute 'distance' between a bootstrap feature set and the original selected feature set. For example, what is the average number of features detected from the whole sample that were found in the bootstrap sample? What is the average number of features selected from a bootstrap sample that were found in the original analysis? What is the proportion of times that a bootstrap found an exact match to the original feature set? What is the proportion that a bootstrap was within one feature of agreeing exactly with the original? Two features?

It would not be appropriate to say that any cutoff should be used in making an overall conclusion.

Regarding part 4., none of this addresses the reliability of the process, i.e., how close the feature set is to the 'true' feature set. To address that, you might do a Monte-Carlo re-simulation study where you take the original sample lasso result as the 'truth' and simulate new response vectors several hundred times using some assumed error structure. For each re-simulation you run the lasso on the original whole predictor matrix and the new response vector, and determine how close the selected lasso feature set is to the truth that you simulated from. Re-simulation conditions on the entire set of candidate predictors and uses coefficient estimates from the initially fitted model (and in the lasso case, the set of selected predictors) as a convenient 'truth' to simulate from. By using the original predictors one automatically gets a reasonable set of co-linearities built into the Monte Carlo simulation.

To simulate new realizations of $Y$ given the original $X$ matrix and now true regression coefficients, one can use the residual variance and assume normality with mean zero, or to be even more empirical, save all the residuals from the original fit and take a bootstrap sample from them to add residuals to the known linear predictor $X\beta$ for each simulation. Then the original modeling process is run from scratch (including selection of the optimum penalty) and a new model is developed. For each of 100 or so iterations compare the new model to the true model you are simulating from.

Again, this is a good check on the reliability of the process -- the ability to find the 'true' features and to get good estimates of $\beta$.

When $Y$ is binary, instead of dealing with residuals, re-simulation involves computing the linear predictor $X\beta$ from the original fit (e.g., using the lasso), taking the logistic transformation, and generating for each Monte Carlo simulation a new $Y$ vector to fit afresh. In R one can say for example

lp <- predict(...) # assuming suitable predict method available, or fitted()
probs <- plogis(lp)
y <- ifelse(runif(n) <= probs, 1, 0)
Frank Harrell
  • 74,029
  • 5
  • 148
  • 322
  • I'm editing my answer to reflect your new questions. – Frank Harrell Oct 09 '14 at 13:02
  • Not sure if I get what you mean by 'reproducible method' and 'unique result'. Would the first mean that based on the description of my method, anyone could reproduce it? And the second, that my methods yields one and only one possible result? – Puzzled Oct 09 '14 at 16:27
  • One basic concern: is bootstrap appropriate for my small sample size (roughly 3 observations per explanatory variable)? – Puzzled Oct 09 '14 at 19:44
  • The bootstrap will work OK - it doesn't depend on that - it mainly depends on overall sample size. I'm updating my original answer now. – Frank Harrell Oct 09 '14 at 20:31
  • I see... Even considering that my outcome variable is binary and that one of the classes has only 32 observations? – Puzzled Oct 09 '14 at 20:39
  • 3
    You should have stated that up front, and I should have asked. You are pushing the envelope far beyond what the available information will support. Think of it this way. For a binary outcome, in order to estimate only the intercept in a binary logistic model, you must have at least 96 observations. Then you need roughly 15 events per *candidate* predictor (if not penalizing). The likelihood of your process validating in a future dataset is fairly slim. Another way of looking at it is that all of this discussion is even more important (compared to having a larger $N$). – Frank Harrell Oct 09 '14 at 21:23
  • Not sure if I got that. So the bootstrapping approach is not adequate for my case? – Puzzled Oct 09 '14 at 21:52
  • 1
    I wasn't referring to bootstrapping. I was referring to whether you can learn anything from dozens of candidate variables when you only have 32 events. – Frank Harrell Oct 09 '14 at 23:53
  • Right. But I thought that lasso was appropriate for situations like that, was I mistaken? – Puzzled Oct 10 '14 at 00:02
  • 3
    Lasso is more appropriate than most methods but reliability goes down with such a small sample size. You are demanding parsimony by using lasso instead of a quadratic (ridge; L2) penalty. You will undoubtedly get better predictive discrimination by using a quadratic penalty and not asking for parsimony. Or do severe data reduction (masked to $Y$) then fit an unpenalized ordinary model. – Frank Harrell Oct 10 '14 at 02:05
  • I am still not sure about how to intepret the results I got from bootstrap, specially considering it would not be appropriate to use a cut-off value. Could I say that there is evidence that the original feature set is relevant in explaining the outcome (since ~7 out of the 9 variables entered the bootstrap samples models on average and considering individual feature inclusion fractions) but prediction would be likely to be enhanced if additional variables are included (since bootstrap sample models tended to include 25 variables on average)? – Puzzled Oct 10 '14 at 13:31
  • As I understand it, there is a trade-off between predictability and model interpretability. I am willing to privilege the second, that is why I chose lasso. Would it be legitimate to use lasso and pose a warning that my results would be generalizable only to a limited extent (obs: I am in the social sciences and my observations are households in a particular rural setting)? – Puzzled Oct 10 '14 at 13:50
  • 1
    Do the re-simulation experiment I suggested to check the actual reliability of the method in your exact setting. – Frank Harrell Oct 10 '14 at 16:22
  • I am new to both bootstrap and Monte Carlo methods. How are the two different? I do not understand the rationale behind resimulating only the response vector (and not the matrix of predictors) as you suggest, can you explain it further? And what is the 'error structure' you mention? – Puzzled Oct 14 '14 at 12:46
  • See my expanded answer. – Frank Harrell Oct 14 '14 at 13:32
  • I am afraid it is still not clear to me the difference between bootstrap and Monte Carlo. I mean, isn't bootstrap also measuring reliability of the results and method? If so, what would the results of Monte Carlo analysis add to that? If not, how can interpret the 'distance between a bootstrap feature set and the original selected feature set' - if it is not measuring reliability, what is it measuring? – Puzzled Oct 14 '14 at 16:48
  • 1
    Please read carefully what I tried to explain above. The bootstrap can well describe the volatility of the model. But the bootstrap simulation is not informed of the true model so it cannot measure closeness to the truth as Monte Carlo simulation (which always starts with the 'truth') can. – Frank Harrell Oct 14 '14 at 16:49
  • After reading [this](http://www.stat.ufl.edu/archived/casella/Papers/BL-Final.pdf) article (pp. 375-377) suggested in [this](http://stats.stackexchange.com/questions/91462/standard-errors-for-lasso-prediction-using-r/91464#91464) post, I became uncertain about how acceptable is using bootstrap to describe 'volatility'. According to the article, it would be questionable to use bootstrap to estimate coefficients' standard deviation (which, as I understand, is a measure of 'volatility'). What do you think? – Puzzled Oct 14 '14 at 17:52
  • That nice paper makes some great points about difficulties of getting standard errors and confidence intervals for $\hat{\beta}$ in penalized situations, using the bootstrap. We have more success using the bootstrap for these purposes in unpenalized situations. But as far as assessing volatility of features selected, I think the bootstrap should do fine. I'm meaning volatility in the sense of getting different features selected using different samples. – Frank Harrell Oct 14 '14 at 18:00
  • How could I implement the Monte Carlo simulation in R, in terms of packages and commands? More specifically: how can I simulate the new Y vectors using residual variance? – Puzzled Oct 14 '14 at 18:20
  • p.s. I am using the `grpreg` package to run group lasso regularised logistic regression – Puzzled Oct 14 '14 at 18:42
  • 2
    I need to sign off this discussion - the basic answer to your question is basic R programming plus take a look at simple simulations in http://biostat.mc.vanderbilt.edu/rms. – Frank Harrell Oct 14 '14 at 18:48
  • One more thing, so that I can look into this in other sources: when you mention 'residual variance', you mean the variance of the original Y vector, right? If so, even considering that the Y vector is binary? – Puzzled Oct 14 '14 at 19:31
  • Last comment. I clarified this in my long answer above. – Frank Harrell Oct 14 '14 at 19:49
  • Can anyone suggest some references about the use of bootstrap and Monte Carlo in the context suggested in this answer (analysis of zero versus non-zero predictor coefficients rather than estimation of coefficients' standard deviation or the like)? – Puzzled Oct 15 '14 at 14:56
  • @Frank Harrell I love your answer very much! I just asked a question also about using lasso for feature selection [https://stats.stackexchange.com/questions/365938/what-features-are-selected-by-lasso], could you provide some hint at your convenience? – meTchaikovsky Sep 09 '18 at 02:31