Why does the determinant of the Hessian grow with n?

Question

Context: I'm trying to understand BIC on a deeper level. I'm using BIC for model/structure selection for Bayesian networks.

I'm confused because BIC is an approximation to the likelihood of a model, and the likelihood should never decrease when the model becomes more complex, but BIC contains a term that penalizes more complex models. So either I'm missing an assumption, or some feature of the approximation that penalizes complexity. The key fact seems to be how the determinant of the Hessian of $\log [p(D, \boldsymbol{\theta}_s | S^h)]$ grows with $N$.

As a reference, I'm using Heckerman, D. "A tutorial on learning with Bayesian networks." Innovations in Bayesian networks. Springer Berlin Heidelberg, 2008. 33-82.

Here's how Heckerman derives the BIC. First define $g(\boldsymbol{\theta}_s)$ (eq. 3.29, page 53):

\begin{align} g(\boldsymbol{\theta}_s) \equiv \log [p(D|\boldsymbol{\theta}_s, S^h)p(\boldsymbol{\theta}_s|S^h)] \end{align}

Then take a Taylor expansion (eq. 3.30, page 53):

\begin{align} g(\boldsymbol{\theta}_s) \approx g(\tilde{\boldsymbol{\theta}}_s) - \frac{1}{2}(\boldsymbol{\theta}_s - \tilde{\boldsymbol{\theta}}_s)A(\boldsymbol{\theta}_s - \tilde{\boldsymbol{\theta}}_s)^T \end{align}

where $A$ is the negative Hessian of $g(\boldsymbol{\theta}_s)$ evaluated at $\tilde{\boldsymbol{\theta}}_s$, and $\tilde{\boldsymbol{\theta}}_s = \arg \max_{\boldsymbol{\theta}_s} g(\boldsymbol{\theta}_s)$.

This lets us see that $p(\boldsymbol{\theta}_s|D, S^h)$ is approximately Gaussian (eq. 3.31, page 53):

\begin{align} \qquad \qquad p(\boldsymbol{\theta}_s|D, S^h) &\propto p(D| \boldsymbol{\theta}_s, S^h) p(\boldsymbol{\theta}_s|S^h) \\ &\approx p(D| \tilde{\boldsymbol{\theta}}_s, S^h) p(\tilde{\boldsymbol{\theta}}_s|S^h) \exp \{ - \frac{1}{2}(\boldsymbol{\theta}_s - \tilde{\boldsymbol{\theta}}_s)A(\boldsymbol{\theta}_s - \tilde{\boldsymbol{\theta}}_s)^T \} \end{align}

This means that we can evaluate the following integral in closed form (eq. 3.40, page 59):

\begin{align} p(D|S^h) = \int p(D|\boldsymbol{\theta}_s, S^h) p(\boldsymbol{\theta}_s| S^h) d\boldsymbol{\theta}_s \end{align}

Substituting in Eq 3.31 and taking the log, we have the Laplace approximation to the likelihood (eq. 3.41, page 59):

\begin{align} \log p(D|S^h) \approx \log p(D|\tilde{\boldsymbol{\theta}}_s, S^h) + \log p(\tilde{\boldsymbol{\theta}}_s| S^h) + \frac{d}{2}\log (2\pi) - \frac{1}{2} \log |A| \end{align}

where $d$ is the dimension of $g(\boldsymbol{\theta}_s)$.

To get BIC, Heckerman drops all the terms that don't depend on N. The remaining terms are $\log p(D|\tilde{\boldsymbol{\theta}}_s, S^h)$, which increases linearly with $N$, and $\log |A|$, which increases as $d \log N$. He also substitutes the ML estimate $\widehat{\boldsymbol{\theta}}_s$ for $\tilde{\boldsymbol{\theta}}_s$. This gives the familiar BIC score (eq 3.42, page 60):

\begin{align} \log p(D|S^h) \approx \log p(D|\widehat{\boldsymbol{\theta}}_s, S^h) - \frac{d}{2} \log N \end{align}

Here's the part I'm confused about: Why does $\log |A|$ increase as $d \log N$? I have no intuition for what the determinant of a Hessian should depend on, and Heckerman states this fact without explaining it.

It seems strange to me. The BIC is an approximation to the likelihood, not the posterior, so it shouldn't have a preference for simpler models -- the likelihood should never decrease as the model gets more complex. But the $-\frac{d}{2} \log N$ term introduces a penalty for large dimension $d$, which looks like a prior that favors simpler models.

See also eq. 5 & 6 in: https://people.eecs.berkeley.edu/~jordan/courses/260-spring10/lectures/lecture16.pdf — rrrrr, Jan 26 '19 at 02:24

foxfin · Accepted Answer · 2018-12-30T12:01:07.417

Q1. because BIC is an approximation to the likelihood of a model and the likelihood should never decrease when the model becomes more complex

There might be some confusion of the marginal likelihood $P(D|S^h)$ and a likelihood $P(D|\theta_s, S^h)$. When you are doing a point estimate of the weight $\theta_s$ given the model $S^h$ (e.g. Maximum likelihood or MAP or SGD in NN) and get $\tilde{\theta_s}$, usually, the likelihood or fitness $P(D|\tilde{\theta_s}, S^h)$ will increase as the model becomes more complicated. However, what we are using for model comparison is the "marginal likelihood" in which the $\theta_s$ variable is integrated out. The marginal likelihood over the hypothesis/model $S^h$ does not necessarily increase as the model family includes more possible functions or more parameters.

Intuitively, you just need a single parameter point in the space to yield good fitness to get a high likelihood, but need many parameter settings to yield acceptable fitness to get a high marginal likelihood.

Another intuition from bishop's PRML, Chapter 3.4:

From a sampling perspective, the marginal likelihood can be viewed as the probability of generating the dataset D from a model whose parameters are sampled at random from the prior.

Q2. Why the log hessian $\log |A|$ grows as $d \log N$

You have already demonstrated in your question how eq.3.41 is derived by using the Laplace approximation of the posterior $P(\theta|D) = \frac{P(D, \theta)}{Z}$ (by taking the 2-order taylor approximation of $g(\theta) = \log P(D,\theta)$ around the mode $\tilde{\theta}$).

To understand the BIC criterion, which is a very loose approximation of eq.3.41, you can add a very broad Gaussian prior for $\theta$, which means $\log P(\theta) = - \frac{(\theta - \theta_0)^T \Lambda (\theta-\theta_0)}{2}+ \mbox{const}$, and $\Lambda$ is very small, then assume all N data in the dataset is i.i.d, we have: $A = - \nabla^2_{\theta} [\log P(D|\theta) + \log P(\theta)] |_{\theta = \tilde{\theta}} = - \nabla^2_{\theta} \log P(D|\theta)|_{\theta = \tilde{\theta}} + \Lambda \approx - \sum_{x \in D}\nabla^2_{\theta} \log P(x|\theta)|_{\theta = \tilde{\theta}} = -\sum_{x \in D} H(x_i)$.

As we know, $\det(c H) = c^{d} \det{H}$, $d$ is the dimension of $H$, $c$ is a constant, so

$\log{\det{A}} \approx \log{(\det{(- \sum_{x \in D} H(x_i))})} \sim O(d\log N)$.

However, in practice, the penalize term $d\log N$ in BIC criterion could be overly strong, as $H(x_i)$ is not so full-rank in practice, so that many $\det{H(x_i)}$ will be close to 0. As a result, BIC tends to show overly favour to simple models.

Q3. The BIC is an approximation to the likelihood, not the posterior, so it shouldn't have a preference for simpler models.

Actually, we should use model posterior $P(S^h|D)$ for model comparison, however, we usually assume an equal prior probability over all the models $S^h$, so we can just use the marginal likelihood $P(D|S^h)$. And, yes, this marginal likelihood will favor expressive and simple model, as illustrated intuitively in the answer of Q1, also shown by the formula eq 3.41 you provided and the approximated BIC criterion.

You should refer to Chapter 3.4 of Bishop's Pattern Recognition and Machine Learning for why marginal likelihood can be used to do model comparison, and Chapter 4.4 for some discussion about how BIC criterion is approximated derived.

I'm not an expert, answering this inactive question just because i'm learning about this topic too. Welcome to point out any imprecise statement or errors in my answer.

Why does the determinant of the Hessian grow with n?

1 Answers1