How to calculate the degrees of freedom for L1 and L2 regularised GLMs?

Question

My goal is to calculate various information criteria for generalised linear models (e.g., the AIC). To do this, we need to calculate the effective degrees of freedom of the trained model. In an unregularised model, this is typically taken to be the number of parameters in the model but it is not clear to me how to deal with the case when we regularise the model.

For a Gaussian noise model this seems well studied by Hui Zou. For a lasso model, it appears that we take the number of non-zero parameters (which is shown to be an unbiased estimator of the df). For ridge regression, the trace of the projection matrix ($S = (X^TX + \lambda I)^{-1}X^TX$) can be used to estimate the degrees of freedom. Zou shows that a combination of these approaches can be used for the elastic net.

My question is can (and if so how) can these results be generalised to any GLM (i.e., not only where we minimise the squared loss)? I would assume (and Park provides additional evidence for this) that the same approach as above can be used for L1 regularised GLMs. It doesn't seem clear how to generalise the results for the ridge or elastic net regression?

I also want to note that this does not duplicate either this Stack Overflow post or this one which only use the number of parameters in the fitted model to calculate the degrees of freedom (thus I think entirely ignore the effect of the L2 regularisation).

score 0 · Answer 1 · answered Feb 17 '22 at 16:52

0

There's a good discussion of information criteria in Gelman et al's BDA 3. If you're proceeding along Bayesian lines, the general solution is to calculate expected predictive log likelihood on holdout data.

(This is because the notion of integer number of free parameters breaks down under constraints on the contents of those parameters. We have to instead ask questions like, "how much information does the model itself carry.")

answered Feb 17 '22 at 16:52

conjectures

3,971
19
36

Thanks for the answer. I agree that looking at it from a Bayesian perspective can help but this doesn't get to the meat of the question which is if one should even use these scores and if so how to go about computing the degrees of freedom. The most principled solutions seem to replace the parameter estimates with a data-based bias correction... – nick Feb 18 '22 at 09:17

nick · Answer 2 · 2022-02-18T10:04:54.310

This other answer provides a useful perspective and asks if AIC/BIC are even meaningful for ridge regression.

I have arrived at the following conclusions:

AIC/BIC assume a maximum likelihood estimate and penalise the complexity of a model by using the notion of effective number of parameters. When optimising regularised linear regression (L1 or L2) it can be interpreted as computing the MAP point estimate for the parameters. In this sense regularised linear regression is not compatible with these metrics. I would argue this generalises to GLMs.
There are other metrics (e.g., DIC) that penalise a model's complexity by using a "data-based bias correction" (BDA3, Pg. 172) rather than trying to compute the "effective degrees of freedom". A metric that does not use the degrees of freedom of the model might make more sense in this case.
The paper from Zou "On the Degrees of Freedom of the Lasso" explicitly does calculate the AIC and BIC for lasso regression. This is because, if we interpret lasso as performing parameter selection, it could be seen in the same light as a different parameter selection routing (e.g., forward or backward stepping using MLE fits at each step). This still doesn't seem to correspond exactly to the MLE but the AIC is an approximation anyway...

To summarise, I don't think it makes sense to calculate the AIC and/or BIC scores for any GLM with L2 regularisation and so computing the degrees of freedom of the model becomes futile.

How to calculate the degrees of freedom for L1 and L2 regularised GLMs?

2 Answers2