2

On page 231 of The Elements of Statistical Learning AIC is defined as follows in (7.30)

Given a set of models $f_\alpha(x)$ indexed by a tuning parameter $\alpha$, denote by $\overline{err}(\alpha)$ and $d(\alpha)$ the training error and number parameters for each model. Then for a set of models we define

$$AIC(\alpha) = \overline{err}(\alpha) + 2 \cdot \frac{d(\alpha)}{N}\hat{\sigma_\epsilon}^2$$

Where $\overline{err}$, the training error, is $\frac{1}{N}\sum_{i=1}^NL(y_i,\hat{f}(x_i))$.

On the same page it is stated (7.29) that

For the logistic regression model, using the binomial log-likelihood, we have

$$AIC = -\frac{2}{N} \cdot \text{loglik} + 2 \cdot \frac{d}{N}$$

where "$\text{loglik}$" is the maximised log-likelihood.

The book also mentions that $\hat{\sigma_\epsilon}^2$ is an estimate of the noise variance, obtained from the mean-squared error of a low-bias model.

It is not clear to me how the first equation leads to the second in the case of logistic regression? In particular what happens to the $\hat{\sigma_\epsilon}^2$ term?

Edit I found in a later example in the book (on page 241) the authors use AIC in an example and say

For misclassification error we used $\hat{\sigma_{\epsilon}}^2=[N/(N-d)] \cdot \overline{err}(\alpha)$ for the least restrictive model ...

This doesn't answer my question as it doesn't link the two aforementioned expressions of AIC, but it does seem to indicate that $\hat{\sigma_{\epsilon}}^2$ is not simply set to $1$ as stated in Demetri's answer.

Seraf Fej
  • 436
  • 2
  • 15

1 Answers1

0

I imagine they made an approximation. $\sigma^2_\epsilon$ is the residual variance of the outcome conditioned on the variables $x_i$. When the outcome is binary, as in logistic regression, $\sigma^2<1$.

When we compare models with AIC, only the absolute differences between models matter, so using the approximation $\sigma^2=1$ for all models isn't so offensive. Let me demonstrate

$$\Delta AIC = AIC_1 - AIC_2 = \dfrac{-2}{N}(\text{loglik}_1 - \text{loglik}_2) + \dfrac{2}{N}(d_1 - d_2) $$

Because we made the assumption that $\sigma^2_\epsilon$ was the same for each model (namely, it was 1) it would factor out of the difference between models' effective number of parameters. Setting $\sigma^2_\epsilon=1$ isn't arbitrary, it is an upper bound on the variance of a binary variable. A least upper bound would be 0.25 and it it isn't quite clear to me why that wasn't chosen, but again the choice of $\sigma^2_\epsilon$ seems only to affect the AIC values and not the differences between model AIC, which is what we're really after.

Demetri Pananos
  • 24,380
  • 1
  • 36
  • 94
  • Hi @Demetri Pananos , maybe I am missing a detail here but it seems to me that arbitrarily setting $\hat{\sigma_\epsilon}^2$ could, in fact, have an effect on the absolute difference between models since it is only multiplied by the second term. Is there something I am missing here? – Seraf Fej May 08 '20 at 14:27
  • @SerafFej See my edit. – Demetri Pananos May 08 '20 at 15:30
  • Thanks for the clarification. Would it not be fair to say, however, that this choice $\hat{\sigma_\epsilon}^2 = 1$ still effects which model has lower AIC? In your example imagine the case where $\text{loglik}_1 - \text{loglik}_2 = 0.5$, $d_1 - d_2 = 1$ and $N = 2$ (these are just random numbers to make the point). Then we obtain that the delta is positive and therefore we should use model 2. However, if we had chosen $\hat{\sigma_\epsilon}^2 = 0.25$ instead, then the delta would be negative, so we should use model 1. This would seem to contradict your point? – Seraf Fej May 08 '20 at 16:00