5

In the first line of the original paper for AIC (1973) "Information Theory and an Extension to Maximum Likelihood Principle", there is the following statement:

enter image description here

I'm wondering if the underlined red Expectation symbol is necessary (adds anything I'm not aware of). As far as I can tell, this is just a statement of traditional cross-entropy term in information theory, i.e.:

\begin{equation} \mathbb{E}_{X\sim P(X)}[\log Q(X)] = \int_{\mathcal{X}} p(x)\log q(x) dx, \end{equation}

and under this definition that additional $\mathbb{E}[\cdot]$ does not need to be there, or am I missing something subtle?

develarist
  • 3,009
  • 8
  • 31
tisPrimeTime
  • 380
  • 1
  • 14

2 Answers2

2

I've thought about this a lot (I am OP), and I've come up with the following conclusion through my readings:

Consider RV $X$ distributed wrt a density $f(x\mid\theta_0)$ (some observed) Here we assume the data can be generated through a parameterization (justify the presence of conditioning), and that in particular samples wrt $X$ are generated wrt a true parameter $\theta_0$.

Consider similarly a new RV for some new data, $Z$, distributed wrt density $f(z\mid \theta_0)$ (and thus generated through the same underlying process as $X$). However we will use this $Z$ to form an estimator for $\theta_0$, which is $\hat{\theta}(Z)$, denoted as $\hat{\theta}$ for short. Moreover consider $X$ and $Z$ to be separate (i.e. independent draws).

Akaike attempts to work with the following expected log-likelihood, $\mathbb{E}_{(X,Z)}[\log f(X\mid \hat{\theta})] = \mathbb{E}_{Z}\mathbb{E}_{X}[\log f(X\mid \hat{\theta})]$ (law of iterated expectation with $X\perp Z$). Remembering that $X$ has associated density $f(x \mid \theta_0)$, we thus arrive at:

\begin{equation} \mathbb{E}_{(X,Z)}[\log f(X\mid \hat{\theta})] = \mathbb{E}_{Z}\left[\int_{\mathcal{X}} f(x | \theta_0)\log f(x\mid \hat{\theta}) dx \right] \end{equation}

Indeed this acts like a form of "mean cross entropy" where we are averaging the estimator $\hat{\theta}$ across the randomness of $Z$. The purpose of AIC is to minimize such a discrepancy (as the naive MLE would select $\hat{\theta} = \arg \min f(z\mid \theta_0)$).

Indeed it was very subtle!

I like this answer also provided (linked below). It has a lot of parallel intution wrt the two datasets mentioned. The extension to the train/test split is particularly cool, and how AIC tries to approximate this in a single sample

AIC with test data, is it possible?

tisPrimeTime
  • 380
  • 1
  • 14
0

I think you are correct: strictly speaking there should be no $\mathbb{E}$ before the integral term (which, by the way, also appears in equation 1.2, but not in equation 2.1). There is one way to see it: the best estimate $\hat{\theta}$ of the ground truth value $\theta$ will be the one that will minimize the Kullback-Leibler divergence between the estimated distribution $f(x|\hat{\theta})$ and the true distribution $f(x|\theta)$: $$ D_{KL} = \int_x f(x|\theta) \log f(x|\theta) - \int_x f(x|\theta) \log f(x|\hat{\theta}) $$

The second integral in the right-hand side is the only one that depends on $\hat{\theta}$, and it needs to be maximized to minimize the $D_{KL}$, hence the statement prior to equation 1.1.

But adding the $\mathbb{E}$ symbol is not wrong in itself. Indeed, the integral $\int_x f(x|\theta) \log f(x|\hat{\theta})$ is only a function of $\hat{\theta}$ and $\theta$ (and not of $x$), so averaging over $x \sim f(x|\theta)$ is going to leave it unchanged.

I would be happy to see if someone has a different opinion on this.

Camille Gontier
  • 1,248
  • 3
  • 12