4

In 2013, @Donbeo asked whether there were any theoretical results supporting use of Cross Validation to choose the lasso penalty, and was scolded in the comments for asking "a pretty generic question about generalization error and empirical risk minimization." Well, I think was a good question.

enter image description here

(Picture from Zou's paper referenced by @Edgar in his answer)

I know it wouldn't work out well to try to estimate $\lambda$ in a frequentist maximum likelihood setting. If I had to propose why, I'd say there are problems with identifiability. But if that's true, then there must be some magical property of Cross Validation (or Empirical Risk Minimization in general) that allows one to estimate it without making any other assumptions. I would appreciate any thoughts on this.

Most of all, I'd like an explanation of what types of parameters in general is Cross-Validation more suited to than traditional inference, and some rationale as to why.

P.S. This post is an interesting read about CV as it relates to empirical bayes, but it focuses more on CV's ability to counteract model misspecification.

Ben Ogorek
  • 4,629
  • 1
  • 21
  • 41
  • 5
    $\lambda$ is't a parameter of the statistical model (that is, it does not describe the DGP, data generation process) it is a parameter of the estimation method, a *hyperparameter*. Maximum likelihood cannot be used for estimation of parameters that do not describe the DGP. – kjetil b halvorsen Sep 30 '20 at 13:39
  • 1
    @kjetilbhalvorsen thank you for that distinction. What about empirical bayes methods that are able to estimate the amount of regularization? – Ben Ogorek Sep 30 '20 at 13:42
  • Just thinking further on this, I guess it's not about whether the method is empirical bayes but whether the shrinkage is built into the model itself, like in a hierarchical modeling setting. Still curious about why trying to find lambda with RMSE, for instance, would clearly fail. – Ben Ogorek Sep 30 '20 at 16:03
  • 1
    I'm not sure, but think it hasto do with cross-validation simulating out-of-sample validation, so doesnot depend on an assumption that the model is the truth ... – kjetil b halvorsen Sep 30 '20 at 16:22

1 Answers1

2

We don't generally consider $\lambda$ as a parameter in the model you want to estimate. It doesn't have an interpretation outside of the model, in terms of your actual data. Instead, we consider $\lambda$ as a tuning parameter or a hyperparameter. This terminology means that $\lambda$ affects how you estimate $\beta$, but you aren't interested in $\lambda$ itself. Every value of $\lambda$ produces a unique estimate of $\beta$, which I'll denote with $\hat{\beta}_\lambda$.

So the penalized least squares equation you posted describes a huge set of possible estimates of $\beta$. You have to choose the "best" estimate, $\hat{\beta}_\lambda$ according to some criteria (best prediction, best model fit, etc.). That's when you do cross-validation to fit $\hat{\beta}_\lambda$ on part of your dataset and check some criterion on the remaining portion.

Eli
  • 1,682
  • 10
  • 24
  • Great answer and this does indeed answer the question. One thing that's still on my mind, and perhaps out of scope, is why minimizing the RMSE with lamda and all the DGP parameters would clearly fail but you're able to do it with cross validation. I'll leave this question open for a few days. – Ben Ogorek Sep 30 '20 at 15:58
  • I don't know what DGP is short for. I don't think minimizing RMSE or any other loss metric would fail in any way. You use cross-validation to mimic having an independent dataset for validation. This prevents overfitting. – Eli Sep 30 '20 at 19:01
  • DGP - "data generating process." You addressed the DGP by saying $\lambda$ "doesn't have an interpretation...in terms of your data". On minimizing RMSE, I see that you're saying the optimization would probably work, just overfit like crazy. Makes sense. – Ben Ogorek Oct 01 '20 at 16:04
  • Yes, that's correct. – Eli Oct 01 '20 at 16:53