I've thought about this a lot (I am OP), and I've come up with the following conclusion through my readings:
Consider RV $X$ distributed wrt a density $f(x\mid\theta_0)$ (some observed) Here we assume the data can be generated through a parameterization (justify the presence of conditioning), and that in particular samples wrt $X$ are generated wrt a true parameter $\theta_0$.
Consider similarly a new RV for some new data, $Z$, distributed wrt density $f(z\mid \theta_0)$ (and thus generated through the same underlying process as $X$). However we will use this $Z$ to form an estimator for $\theta_0$, which is $\hat{\theta}(Z)$, denoted as $\hat{\theta}$ for short. Moreover consider $X$ and $Z$ to be separate (i.e. independent draws).
Akaike attempts to work with the following expected log-likelihood, $\mathbb{E}_{(X,Z)}[\log f(X\mid \hat{\theta})] = \mathbb{E}_{Z}\mathbb{E}_{X}[\log f(X\mid \hat{\theta})]$ (law of iterated expectation with $X\perp Z$). Remembering that $X$ has associated density $f(x \mid \theta_0)$, we thus arrive at:
\begin{equation}
\mathbb{E}_{(X,Z)}[\log f(X\mid \hat{\theta})] = \mathbb{E}_{Z}\left[\int_{\mathcal{X}} f(x | \theta_0)\log f(x\mid \hat{\theta}) dx \right]
\end{equation}
Indeed this acts like a form of "mean cross entropy" where we are averaging the estimator $\hat{\theta}$ across the randomness of $Z$. The purpose of AIC is to minimize such a discrepancy (as the naive MLE would select $\hat{\theta} = \arg \min f(z\mid \theta_0)$).
Indeed it was very subtle!
I like this answer also provided (linked below). It has a lot of parallel intution wrt the two datasets mentioned. The extension to the train/test split is particularly cool, and how AIC tries to approximate this in a single sample
AIC with test data, is it possible?