4

I have X possible predictors for response Y. In my case X >> Y.

I have noticed in my runs of cv.glmnet (leave-on-out and all other params default) that if I try to predict using lambda.min that it simply returns the mean value of Y. If I run the prediction with choices of lambda < lambda.min, it gives actual predictions - which have a lower error than using the mean value of Y.

I'm not sure what's going on here. It's as if the code is defaulting to a dummy predictor (the mean response) for some reason. It appears that this behavior is a function of the size of X.

Here's a simple example:

x=replicate(100,rnorm(10))

y=replicate(1,rnorm(10))

cvfit=cv.glmnet(x,y,nfolds=10)

ypred1=predict(cvfit,newx=x,s="lambda.min")

(in a case I just ran, this gives a cvfit$lambda.min = 0.8453387 and all entries in ypred1 are the mean value of y. So, let's choose a different lambda)

ypred2=predict(cvfit,newx=x,s=0.1)

mse1=mean((ypred1-y)^2) = 1.20

mse2=mean((ypred2-y)^2) = 0.03

I understand that "newx=x" doesn't make sense for any real work, but I don't understand why it returns the predictions it does.

David Marx
  • 6,647
  • 1
  • 25
  • 43
PickledZebra
  • 169
  • 1
  • 1
  • 10
  • 3
    Can you provide a reproducible example? Also, I believe it's actually generally recommended to use `lambda.1se` over `lambda.min` – David Marx Apr 12 '14 at 18:29
  • Thanks for the comment. I'm not sure how to do this. In my case, the size of X (number of potential predictors) is > 500. Also, lambda.1se = lambda.min when I see this happen. – PickledZebra Apr 12 '14 at 18:53
  • 1
    I strongly suspect there's something wonky with your code or the data you are feeding your model. `lambda.1se` should be larger than `lambda.min`. Without access to your code and/or data that reproduces the issue, I don't really know what to tell you. – David Marx Apr 12 '14 at 19:19
  • I've tried to create an example and edited the original post. – PickledZebra Apr 12 '14 at 19:32
  • `lambda.1se == lambda.min` tells you something is going wrong. don't rely on the default lambda sequence, pass in your own one: `cv.glmnet(..., lambda=10^(seq(m,-n,0.2)))` Also, show us your plot: `plot.cv.glmnet(cvfit, sign.lambda=-1)` – smci Feb 10 '17 at 14:27
  • 1
    Try to cook up a more meaningful testcase than x,y using `rnorm()`, even if you did `set.seed()`. You wouldn't expect y ~ x. Pick one of the builtin R datasets. – smci Feb 10 '17 at 14:50
  • A better-illustrated example of what looks like the same phenomenon appears at https://stats.stackexchange.com/questions/319861/how-to-interpret-lasso-shrinking-all-coefficients-to-0/319897#319897. – whuber Jan 20 '18 at 18:05
  • Is plot.cv.glmnet a function? It doesn't come in my package. Plot does. – Emmanuel Goldstein Jun 16 '21 at 16:21

3 Answers3

8

Here, glmnet is working as intended! In your example, there is no relationship between $x$ and $y$ (both were independently generated). So the ``correct'' thing to do is to just always predict $\hat{y} = \bar{y}.$ Any method that isn't doing that is overfitting the test set.

Stefan Wager
  • 2,233
  • 11
  • 7
  • I understand that. But, in the case I show, glmnet did find a fit. It can overfit even if the two sets were independently generated. I don't understand how the algorithm determines that the model it generated was overfit and decided to do the "correct" thing. – PickledZebra Apr 12 '14 at 20:09
4

This is simply a case of how glmnet generates a default sequence of lambdas. It calculates the maximum lambda in its grid search based on a computation from your data, it is essentially the minimal lambda that sets all the coefficients in your model equal to zero.

The details are in this paper, but the rub is that:

$$ N \alpha \lambda_{max} = \max_l \| \langle x_l, y \rangle \| $$

Since the correct answer in your case is to zero out all (non-intercept) coefficients, I would guess it quite likely that cv.glment is returning this $\lambda_{max}$ as optimal.

Matthew Drury
  • 33,314
  • 2
  • 101
  • 132
  • This is missing the important part: *"Never rely on the "default" lambda sequence, always supply your own one"* – smci Feb 24 '18 at 05:48
2

lambda.1se == lambda.min

"All entries in ypred1 are the mean value of y"

Both of these tell you that your coefficients got zeroed out. (You should always inspect the coefficients with coef(cvfit, s='lambda.1se')) Either lambda was too small, or your x variables weren't normalized. Or they simply aren't predictive of y.

Never rely on the "default" lambda sequence, always supply your own one:

cv.glmnet(..., lambda=10^(seq(m,-n,0.2)))

which uses a lambda geometrically from 10^m ... 10^(-n). You choose m, -n to be wide enough to cover the range that includes all possible optimal values of lambda.

So rerun with that, then pick the right lambda./ Also, show us your deviance plot: plot.cv.glmnet(cvfit, sign.lambda=-1) There's supposed to be a knee in it. Yours apparently doesn't.

If you double-checked all the above, then your x-variables simply aren't predictive of y. (unless you made a gross error and you scrambled x,y)

smci
  • 1,456
  • 1
  • 13
  • 20
  • 2
    Could you extend this statement "Never rely on the "default" lambda sequence, always supply your own one" ? – rook1996 Jun 11 '18 at 10:44
  • @rook1996: I already do, I show a user-defined lambda sequence `lambda=10^(seq(m,-n,0.2))` – smci Jun 11 '18 at 22:03
  • 4
    No I mean, why should I supply my own Lambda sequence. What is wrong with the calculation per default Lambda ? Do you have some sources where this argument is stated ? And which values should I choose for m and n – rook1996 Jun 11 '18 at 22:17
  • Sorry, what is `m` in the above sequence? – AW27 Feb 12 '21 at 08:18
  • @AW27: lambda is a geometric sequence from 10^m ... 10^(-n), per my edit. That's what `lambda=10^(seq(m,-n,0.2))` does. (In R, we can apply 10^ to a vector, and the result is a vector. See the tutorials). – smci Feb 13 '21 at 00:00
  • Thanks, but I thought you had a particular number in mind. After looking around I've found that you can select any number for `m` and `n`. – AW27 Feb 13 '21 at 08:18
  • 1
    @AW27: no, that's why they're parameterized `m` and `n`. The whole advice here is there's no right parameter knowable in advance, so you test over a wide range of `m,...,(-n)` to find the right value for your data. – smci Feb 13 '21 at 09:27
  • 3
    @smci in your comment you state "Never rely on the "default" lambda sequence, always supply your own one", where and why is there evidence that supports this? [glmnet documentation](https://cran.r-project.org/web/packages/glmnet/vignettes/glmnet.pdf) doesn't necessarily state that so I was wondering if there was documentation elsewhere. – AW27 Feb 17 '21 at 11:11
  • AW27: this is the emphatic advice from every ML expert/course/coder I've dealt with. Why they've never fixed the glmnet doc on this is beyond me; by all means look at or file a docbug for them. – smci Feb 17 '21 at 22:10
  • 1
    "this is the emphatic advice from every ML expert/course/coder". However, it would be important to state the why, and not just they advice against it because... just because. I like doing things rationally, not just because I heard some people do it. – Emmanuel Goldstein Jun 16 '21 at 16:30
  • @EmmanuelGoldstein: I already stated the why, above, months ago. [Feb 13]: *"so you test over a wide range of m,...,(-n) to find the right value for your data."* I also edited the answer to be even more explicit. – smci Jun 16 '21 at 21:19
  • 3
    The main issue I think everyone is having with the "just because" selection of lambda in your example is that you can't guarantee that your sequence with whatever ``m`` and ``n`` will cover the necessary range of possible ``lambda`` values. In literature, however, maximum value is well [documented](https://stats.stackexchange.com/a/174907/87362), as for the minimum value, don't forget that glmnet also implements various heuristic rules, e.g., [Tibshirani et. al.](https://statweb.stanford.edu/~tibs/ftp/strong.pdf), among other. The idea is to stop _before_ reaching the minimum value – runr Jun 16 '21 at 21:46
  • @runr: No, the the idea is to continue somewhat past the minimum value, then retrospectively determine your `lambda.min` and `lambda.1se`. (Otherwise how would you know you hadn't just hit a local minimum?) I haven't looked at glmnet's source code, but in multiple courses I was taught we easily saw how glmnet's default lambda sequence often failed to find the optimal lambda. This was between 2012-2016, so unless the code has improved radically, that would still be true. – smci Jun 17 '21 at 01:22
  • @EmmanuelGoldstein: I saw with my own eyes how the default lambda sequence routinely failed to find optimal values, on various datasets. That's as rational as it gets. Your comment is extremely unconstructive, the more so because it had already been answered years/months previously. – smci Jun 17 '21 at 01:25
  • 1
    Apologies for the unconstructedness of my comment. My point is that I don't know what type of datasets you routinely work with that this phenomenon occurs. It may not be in all datasets in the universe. I just need a mathematical description. – Emmanuel Goldstein Jun 17 '21 at 08:56
  • @EmmanuelGoldstein: simple, 20-year old datasets, the ones commonly used in teaching data science, exactly like my previous comments implied. Nothing fancy or corner-case. I don't have a mathematical description. When all the practitioners I learn from explicitly warn me this is a known issue with glmnet, and prove it multiple times, I accept it as proven fact. By the way, `10^(seq(m,-n,0.2))` slices the lambda sequence with 5 slices per decade, and IIRC glmnet's default sequence didn't do more than 2/decade, another reason it missed finding optima, not just wide enough range m,n. – smci Jun 17 '21 at 09:18
  • @smci I think you could put large parts of this discussion to rest if you edited your answer to include such a data set, or a link to it, with a worked example of how this phenomenon manifests. Naturally, a demonstration on one data set is not sufficient to demonstrate that the `glmnet` defaults are always defective, or that one must always use a user-provided interval, but this demonstration would provide a way to verify your claim. Alternatively, a citation to an academic article or other resource could also provide support to the claim. – Sycorax Jun 17 '21 at 15:08
  • 2
    @EmmanuelGoldstein You might consider asking a question about the suitability of the `glmnet` defaults. There's no need to mention this answer or users in this thread -- just state in specific but neutral terms that you had someone tell you about this deficiency and you want to know if there is an academic reference that discusses it in detail, a mathematical demonstration, etc. – Sycorax Jun 17 '21 at 16:53
  • Thanks, I will do this. – Emmanuel Goldstein Jun 17 '21 at 20:46
  • @Sycorax and EmmanuelGoldstein: yes, please ask it as a separate question. I'm not the glmnet maintainers and I'm not Tibshirani et al. If I had anything more concrete I would have posted it 5 years ago already. – smci Jun 17 '21 at 22:06