Formal proof of Occam's razor for nested models

Question

I consider 2 models $M_0$ and $M_1$, $M_1$ being more complicated than $M_0$ in the sense that it has more parameters (I usually assume than $M_0$ is nested within $M_1$). They are respectively parametrized by $\theta_0$ and $\theta_1$. I assume that

$\theta_0 \subset \theta_1$ (i.e. $M_1$ has the same parameters as $M_0$ plus extra parameters)
$p(\theta_0|M_1) = p(\theta_0|M_0)$ (both models have the same priors for the parameters they have in common)

I would like to prove the following inequality:

$\forall \theta_0 \\ \langle \log p(\mathcal{D | M_0}) \rangle _{p(\mathcal{D | \theta_0, M_0})} \geq \langle \log p(\mathcal{D | M_1}) \rangle _{p(\mathcal{D | \theta_0, M_0})}$

i.e. that on average, if my data $\mathcal{D}$ are generated from $M_0$ parametrized with a given $\theta_0$, then the Bayes factor is going to favor $M_0$ over $M_1$.

Has it already been done ? Intuitively, it is an application of Occam's razor (a simpler and true model will be favored over a more complicated one), but I lack a formal proof.

Precision on the notations : $p(\mathcal{D}|M_0,\theta_0)$ is not the same as $p(\mathcal{D}|M_0)$, and I thus cannot use the positivity of the Kullback-Leibler divergence. In "$M_0,\theta_0$", I specify both the model and its parameters. In "$M_0$", I only specify the model. $p(\mathcal{D}|M_0,\theta_0)$ is the probability that the data $\mathcal{D}$ are generated from model $M_0$ with parameters $\theta_0$, while $p(\mathcal{D}|M_0)$ is the marginal likelihood over all parameters (the one we use to compute the Bayes factor) : $\int_{\theta} p(\mathcal{D}|M_0,\theta)p(\theta|M_0)$ where $p(\theta|M_0)$ is the prior of parameters under model $M_0$.

If I understand your notation correctly, the difference between the two expressions would be the [Kullback-Leibler divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence#Definition) or "relative entropy" between the models (for any fixed values of $\theta_0$ and $\theta_1$). Your inequality appears to be [Gibbs' Inequality](https://en.wikipedia.org/wiki/Gibbs%27_inequality). — whuber, Feb 12 '20 at 13:48
Sadly, it is not. $p(\mathcal{D}|M_0,\theta_0)$ is not the same as $p(\mathcal{D}|M_0)$, and I thus cannot use the positivity of the Kullback-Leibler divergence. — Camille Gontier, Feb 12 '20 at 17:47
Then could you please clarify the difference between "$M_0,\theta_0$" and "$M_0$"? — whuber, Feb 12 '20 at 18:31
In "$M_0,\theta_0$", I specify both the model and its parameters. In "$M_0$", I only specify the model. $p(\mathcal{D}|M_0,\theta_0)$ is the probability that the data $\mathcal{D}$ are generated from model $M_0$ with parameters $\theta_0$, while $p(\mathcal{D}|M_0)$ is the marginal likelihood over all parameters (the one we use to compute the Bayes factor) : $p(\mathcal{D}|M_0) = \int_{\theta}p(\mathcal{D}|M_0,\theta)p(\theta|M_0)$ where $p(\theta|M_0)$ is the prior of parameters under model $M_0$. — Camille Gontier, Feb 12 '20 at 18:38
Thank you. Because this is your first mention of a prior distribution over the parameters, it would be best to make that explicitly in your post. Currently, the notation is sufficiently compact that it leaves too much up to the interpretations of each reader. — whuber, Feb 12 '20 at 18:40
This cannot be proven true because there seems to be no guarantees that $p(\theta|M_0)$ won't assign $0$ probability to $\theta_0$, while $p(\theta|M_1)$ assigning non-zero probability to $\theta_0$ — Cagdas Ozgenc, Feb 12 '20 at 19:38
The result cannot hold in general as it depends on the choice of the priors over both models. As an extreme example. take priors degenerated at $\theta_0$. — Xi'an, Feb 13 '20 at 07:59
@CagdasOzgenc an implicit assumption is that $p(\theta_0|M_0)$ is not zero. As data $\mathcal{D}$ were generated from $M_0$ parametrized with $\theta_0$, $\theta_0$ is a possible value for the parameters. — Camille Gontier, Feb 13 '20 at 09:17
@Xi'an I indeed realize that model complexity depends not only on the number of parameters, but also on their priors. For instance, by choosing a Dirac centered on a certain value for the prior of the parameter, we over-simplify the model. I clarified my question (see both assumptions at the beginning) to assume that both models have the same priors for the parameters they have in common. — Camille Gontier, Feb 13 '20 at 09:26

score 0 · Accepted Answer · answered Feb 26 '20 at 11:07

Here is my attempt at answering the question:

Proposition: Let $\mathcal{M}_0$ and $\mathcal{M}_1$ two nested models such that $\mathcal{M}_0 \preceq \mathcal{M}_1$. We note $\Theta_0$ and $\Theta_1$ the space of possible parameters for $\mathcal{M}_0$ and $\mathcal{M}_1$, with $\Theta_0 \subset \Theta_1$. If data generated from $\mathcal{M}_0$ and $\mathcal{M}_1$ are IID, then the following inequality holds $\forall \theta_0^* \in \Theta_0$:

\begin{equation} \label{eq:proposition1} \langle \log p(\mathcal{D}|\mathcal{M}_0) \rangle _{p(\mathcal{D}| \theta_0^*,\mathcal{M}_0)} \geq \langle \log p(\mathcal{D}|\mathcal{M}_1) \rangle _{p(\mathcal{D}| \theta_0^*,\mathcal{M}_0)} \end{equation}

If data are not IID, a sufficient condition for the inequality to hold is

\begin{equation} \label{eq:condition1} k_{\mathcal{M}_0} \log (2 \pi) - \sum_{i=1}^{k_{\mathcal{M}_0}} \langle \log (\lambda_{i}^0) \rangle _{p(\mathcal{D}| \theta_0^*,\mathcal{M}_0)} \geq k_{\mathcal{M}_1} \log (2 \pi) - \sum_{i=1}^{k_{\mathcal{M}_1}} \langle \log (\lambda_{i}^1) \rangle _{p(\mathcal{D}| \theta_0^*,\mathcal{M}_0)} \end{equation}

where

$k_{\mathcal{M}_0}$ and $k_{\mathcal{M}_1}$ are the number of independent parameters of $\mathcal{M}_0$ and $\mathcal{M}_1$;

$H_0(\hat{\theta}_0)$ and $H_1(\hat{\theta}_1)$ are the Hessian matrices of the log-likelihoods $p(\mathcal{D}|\theta_0,\mathcal{M}_0)$ and $p(\mathcal{D}|\theta_1,\mathcal{M}_1)$ expressed at their respective MLEs;

$\{\lambda^0_i\}_{1 \leq i \leq k_{\mathcal{M}_0}}$ and $\{\lambda^1_i\}_{1 \leq i \leq k_{\mathcal{M}_1}}$ are the respective eigenvalues of $-H_0(\hat{\theta}_0)$ and $-H_1(\hat{\theta}_1)$.

Proof: using the same approximation as in the derivation of the BIC for $p(\mathcal{D}|\mathcal{M}_0)$ and $p(\mathcal{D}|\mathcal{M}_1)$ yields

\begin{gather} \log p(\mathcal{D}|\mathcal{M}_0) = \log p(\mathcal{D}|\hat{\theta}_0,\mathcal{M}_0) + \log \pi(\hat{\theta}_0|\mathcal{M}_0)+ \frac{k_{\mathcal{M}_0}}{2} \log (2 \pi) - \frac{1}{2} \log (|-H_0(\hat{\theta}_0)|)\\ \log p(\mathcal{D}|\mathcal{M}_1) = \log p(\mathcal{D}|\hat{\theta}_1,\mathcal{M}_1) + \log \pi(\hat{\theta}_1|\mathcal{M}_1)+ \frac{k_{\mathcal{M}_1}}{2} \log (2 \pi) - \frac{1}{2} \log (|-H_1(\hat{\theta}_1)|) \end{gather}

Both quantities then need to be averaged over $\langle \cdot \rangle_{p(\mathcal{D}| \theta_0^*,\mathcal{M}_0)}$. Assuming

\begin{equation} \langle \log p(\mathcal{D}|\hat{\theta}_0, \mathcal{M}_0) \rangle _{p(\mathcal{D}| \theta_0^*,\mathcal{M}_0)} \approx \langle \log p(\mathcal{D}|{\theta}_0^*, \mathcal{M}_0) \rangle _{p(\mathcal{D}| \theta_0^*,\mathcal{M}_0)} \end{equation}

(i.e. that the maximum likelihood estimator $\hat{\theta}_0$ will be close to the true value $\theta_0^*$ from which data were generated) yields $\langle \log p(\mathcal{D}|\hat{\theta}_0, \mathcal{M}_0) \rangle _{p(\mathcal{D}| \theta_0^*,\mathcal{M}_0)} \geq \langle \log p(\mathcal{D}|\hat{\theta}_1, \mathcal{M}_1) \rangle _{p(\mathcal{D}| \theta_0^*,\mathcal{M}_0)}$ (under Gibbs's inequality). Furthermore, $k_{\mathcal{M}_0} \leq k_{\mathcal{M}_1}$ yields $\pi(\hat{\theta}_0|\mathcal{M}_0) \geq \pi(\hat{\theta}_0|\mathcal{M}_1)$ (these quantities do not depend on $\mathcal{D}$). The inequality is thus met for the first two terms on the right-hand side.

For the last two terms, if data are IID and if the number of data points $T$ in $\mathcal{D}$ is sufficiently large, the same approximation as in the derivation of the BIC can be made:

$$ \frac{k_{\mathcal{M}}}{2} \log (2 \pi) - \frac{1}{2} \log (|-H(\hat{\theta})|) \approx -\frac{k_{\mathcal{M}}}{2} \log (T) $$

Since $k_{\mathcal{M}_0} \leq k_{\mathcal{M}_1}$, the inequality thus holds if data generated from $\mathcal{M}_0$ and $\mathcal{M}_1$ are IID.

If data are correlated, the above approximation does not hold. However, the determinant of the Hessian (which is a symmetric matrix) can be written as the product of the eigenvalues, which finally leads to the necessary condition. This inequality can also be seen as a more general version of a result presented in the following paper using less stringent approximations :

Heavens, Alan F., T. D. Kitching, and L. Verde. "On model selection forecasting, dark energy and modified gravity." Monthly Notices of the Royal Astronomical Society 380.3 (2007): 1029-1035.

Formal proof of Occam's razor for nested models

1 Answers1

Linked