8

I will use an elastic net to estimate a regression model which will later be used for forecasting.

I have a grid of $\alpha$ values within [0,1] representing the proportion of $L_1$ versus $L_2$ penalty.
I also have a grid of $\lambda$ values for the amount of penalization.

There are at least two alternatives for selecting the optimal combination $(\alpha,\lambda)$:

  1. Perform leave-one-out cross validation (LOOCV) to see which combination $(\alpha,\lambda)$ delivers the lowest MSE on the validation sets (and maybe use the one-sigma rule towards parsimony).
  2. Use the whole sample to see which combination $(\alpha,\lambda)$ delivers the lowest AIC.

In the second alternative, the degrees of freedom used in AIC would be based on the effective degrees of freedom of an elastic net. (I suppose the latter should be possible to obtain as the effective degrees of freedom are known for both LASSO and ridge regression.)

Question: Which of 1. and 2. is better and why?

Some thoughts:

  • In the context of feature selection, LOOCV is known to be asymptotically equivalent to AIC-based selection. So asymptotically I would expect both 1. and 2. to yield the same result. But what about finite samples?
  • Alternative 2. could be preferred due to speed.
  • Alternative 2. requires specifying the error distribution.
  • Is it fine to use effective degrees of freedom when calculating AIC?

Here are a couple of related questions: this and this.

Richard Hardy
  • 54,375
  • 10
  • 95
  • 219
  • If the model is to be used for forecasting, why are you interested in parsimony rather than simply using ridge regression and using information from all the predictors? – EdM Oct 05 '15 at 14:53
  • Indeed, I am interested in forecasting, and parsimony per se plays no role. Still, when it comes to forecasting ridge regression does not beat LASSO or elastic net by design, does it? So therefore I opt for elastic net which has the flexibility to choose the data-based balance between ridge and LASSO penalties. If it was the one-sigma rule that caught your attention, I just thought it is a standard so why not give it a try. On the other hand, it might not make much sense to be conservative when parsimony is not a goal I am seeking, so I might just give it up. – Richard Hardy Oct 05 '15 at 18:12
  • Apparently, @FrankHarrell has a number of answers mentioning (successful) use of effective AIC (e.g. [this](http://stats.stackexchange.com/questions/26528/how-to-estimate-shrinkage-parameter-in-lasso-or-ridge-regression-with-50k-varia/26751#26751)), i.e. AIC calculated using effective degrees of freedom. So if I understand it correctly, selecting $\lambda$ using effective AIC can be a good idea. – Richard Hardy Oct 11 '15 at 19:41
  • 1
    +1. But cross-validation does not necessarily mean leave-one-out. It is generally believed to have high variance and the usual recommendation is to use something like 10-fold CV instead. This is not asymptotically equivalent to AIC anymore (and I am not sure what exactly are the conditions under which LOOCV is asympt. eq. to AIC). What I see in machine learning community, is that people tend to use cross-validation as the method of choice. – amoeba Oct 20 '15 at 22:37
  • I think in general practitioners will fix `alpha=0.5` when fitting an elastic net model, and use cross validation only to select $\lambda$. Searching over a grid can lead to overfitting even if your using cross validation. Nevertheless, there is an interval-search algorithm implemented in the [c060](http://www.jstatsoft.org/article/view/v062i05/v62i05.pdf) package will select the optimal parameter combination. – user230309 Oct 11 '15 at 16:13
  • Thanks for the link to c060! I am not sure about overfitting, though. See the argumentation [here](http://stats.stackexchange.com/questions/173647/grid-fineness-and-overfitting-using-regularization-lasso-ridge-elastic-net), lines 5-8 of the answer. The argument is intended for $\lambda$ but could perhaps be applied to $\alpha$, too. Anyhow, you answer could better serve as a comment as it does not address the question I have asked. – Richard Hardy Oct 11 '15 at 19:42
  • Your argument would be compounded with both penalties. You should ask yourself the question "why am I trying to select the optimal combination and what benefits are there over the common choice of 0.5 for alpha". I cannot comment because I don't have a rep of 50 – user230309 Oct 28 '15 at 23:06

0 Answers0