8

I wonder about

  • the optimal grid fineness and
  • what the relation between grid fineness and overfitting is

in regularization methods such as LASSO, ridge regression or elastic net.

Suppose I want to fit a regression model using LASSO to a sample of 500 observations (I do not have the data; this is just an example). Suppose also that I have
(A) a grid with 100 different $\lambda$ values in the range between $\lambda_{min}$ and $\lambda_{max}$
(B) a grid with 1000 different $\lambda$ values in the same range,
where $\lambda$ is the parameter controlling the degree of penalization.

Questions:

  1. Can I say something about the propensity to overfit in (A) versus (B)?
  2. Can I determine the optimal grid fineness? How?
Richard Hardy
  • 54,375
  • 10
  • 95
  • 219

1 Answers1

7

Can I say something about the propensity to overfit in (A) versus (B)?

Provided that both grids cover a sufficient range, grid fineness doesn't really have anything to do with overfitting in this problem (though a coarse grid might underfit if it skips over a profitable interval). It's not as if testing too many values will somehow change what out-of-sample looks like.* In the case of these penalized regressions, we definitely want to optimize our penalized likelihood function for values $\lambda$, and it doesn't matter how many values of $\lambda$ we test, because out-of-sample performance for a fixed data set and fixed partitioning is entirely deterministic. More to the point, the out-of-sample metric is not at all altered by how many values $\lambda$ you test. A coarser grid might mean that you skip over the absolute minimum in your out-of-sample metric, but finding the absolute minimum probably isn't desirable in the first place because hyperparameters tend to be poorly estimated, and finite sample properties mean that data limitations will be a source noise in that estimate that will overwhelm slight changes in the distance between adjacent grid points: the standard error of your estimate will tend to swamp differences in grid fineness.

If you're truely concerned that out-of-sample performance metric might be overly optimistic, you could adopt the 1 standard error rule, which picks the most regularized model within 1 standard error of the minimum. That way, you're being slightly more conservative and picking a less complex model.

Can I determine the optimal grid fineness? How?

The LARS algorithm does not a priori define which values of $\lambda$ to check; rather, $\lambda$ is changed continuously and the algorithm checks for values of $\lambda$ for which a coefficient goes from 0 to a nonzero value. Those values of $\lambda$ where a new coefficient is nonzero are retained, with the observation that coefficient paths are piecewise linear in the case of the lasso, so there's no loss of information by just storing off the knots in that case. LARS only works when coefficient paths are piecewise linear, though. The ridge penalty never shrinks a coefficient to precisely zero, so all of your coefficient paths are smooth and always nonzero; likewise elastic net regressions (excluding the case of elastic net regressions which are also lasso regressions).

But most people use GLMNET because it's often faster. In terms of determining what grid of $\lambda$ to search over, I recommend reading the GLMNET article "Regularization Paths for Generalized Linear Models via Coordinate Descent" by Jerome Friedman, Trevor Hastie, and Rob Tibshirani. In it, they develop a very efficient algorithm for estimating ridge, lasso and elastic net regressions. The algorithm checks for a value of $\lambda_\text{max}$ for which $\beta$ is the zero vector, and then identifies a minimum value $\lambda_\text{min}$ relative to $\lambda_\text{max}$. Finally, they generate a sequence of values between the two uniformly on the log scale. This grid is sufficient for most purposes, though it does omit the property that you will know precisely when a coefficient is estimated at a nonzero value. Warm starts are used to provide solutions much more quickly, and it supports many common GLMs.


*You might be thinking about this from the perspective of an artificial neural network, where early stopping is sometimes used to accomplish regularization, but that's an entirely unrelated problem (namely, that the optimization algorithm is prevented from reaching an optimum, so the model is forced to be less complex).

Sycorax
  • 76,417
  • 20
  • 189
  • 313
  • 2
    I don't think you're quite right in the description of how glmnet chooses the lambdas user777. Check out section 2.5 in the paper, where they discuss the choice of minimum and maximum lambda, and those in-between. You may be thinking of LARS, which indeed does what you describe, but I don't believe has been generalized to include a ridge penalty. – Matthew Drury Sep 22 '15 at 13:48
  • @MatthewDrury Bah. You're right. I was thinking of LARS. – Sycorax Sep 22 '15 at 13:49
  • I have read some related material and perhaps that paper, too, but the following was not completely convincing for me: *Finally, they generate a sequence of values between the two uniformly on the log scale.* Is there some justification showing that this is an optimal choice? Also, how do they choose the fineness of the grid? I don't remember having read a good explanation. – Richard Hardy Sep 22 '15 at 14:22
  • The values are generated on the log scale because $\lambda$ is a scale parameter. (Recall that penalized regression correspond to MAP of Bayesian regressions with particular prior distributions, and that there is a correspondence between the scale parameter of the prior and $\lambda$.) The notion is that scale parameters only meaningfully vary on the log scale. – Sycorax Sep 22 '15 at 14:25
  • OK, my trouble wasn't really with the log scale but with its fineness -- but thanks for the explanation! And as a side question (which I might need to post explicitly), what about the fineness of the grid balancing the $L_1$ versus $L_2$ penalty in elastic net? Is that associated with overfitting? – Richard Hardy Sep 22 '15 at 14:27
  • For non-lasso problems, there is no optimal choice because coefficient paths are continuous, so in terms of fineness, you'll always be making a tradeoff between knowing more about coefficient paths and doing more computation. But this question seems underdetermined, since you haven't specified what optimal means here. – Sycorax Sep 22 '15 at 14:29
  • To answer your question, though, I think I can save you the trouble -- I'm not aware of a problem for which the fineness of the hyperparameter grid is the source of overfitting. – Sycorax Sep 22 '15 at 14:31
  • 1
    I've observed in all my uses of glmnet that the change in log likelihood between consecutive grid points is always dominated by the estimated std-error of said estimates. So the standard grid is fine enough that any information obtained from an increased resolution would be dominated by uncertainty in the lambda estimate. – Matthew Drury Sep 22 '15 at 14:45
  • Thanks, guys! I will take some time to digest your answers. Regularization is not my everyday's work object, so I may be slow in grasping the essence. – Richard Hardy Sep 22 '15 at 14:54
  • I disagree with the first paragraph. It seems that the same argument that you are making would apply not only to trying out different values of lambda but also to trying out different models altogether, perhaps different sets of predictors etc. This is clearly wrong: imagine trying out 1 mln models, having only 500 observations. You will overfit, even though you are using cross-validation. DikranMarsupial calls it "overfitting the model selection criterion" (see Cawley & Talbot (2010) *On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation*). – amoeba Mar 08 '17 at 11:24
  • [cont.] However, even given that, I still believe that having a finer grid for lambdas would not overfit. But the reason for that must be more subtle. – amoeba Mar 08 '17 at 11:25
  • 1
    @amoeba The small sample size case would be swamped by the variance in the CV estiamtes: any $\lambda$ in the vicinity of the minimum would be essentially the same. This is why there's no real payoff to increased grid fineness. Also, the $\lambda$ trajectories are usually nice curves, so increasing grid fineness will just populate the space between estiamtes. In examples I've seen, the response curve doesn't dramatically swing up and down, especially not in some fine interval. – Sycorax Mar 08 '17 at 15:25
  • It is possible to over-fit the model selection criterion by over-optimising it. This is unlikely to be a problem for just one regularisation problem if the dataset is reasonably large. I wrote a paper demonstrating this for LS-SVM but it was rejected for not being interesting enough, even though a statistically significant deterioration in generalisation was demonstrated and the paper was arguing for computationally cheap sparse grid search. – Dikran Marsupial Aug 21 '21 at 08:41