Context: I'm trying to understand BIC on a deeper level. I'm using BIC for model/structure selection for Bayesian networks.
I'm confused because BIC is an approximation to the likelihood of a model, and the likelihood should never decrease when the model becomes more complex, but BIC contains a term that penalizes more complex models. So either I'm missing an assumption, or some feature of the approximation that penalizes complexity. The key fact seems to be how the determinant of the Hessian of $\log [p(D, \boldsymbol{\theta}_s | S^h)]$ grows with $N$.
As a reference, I'm using Heckerman, D. "A tutorial on learning with Bayesian networks." Innovations in Bayesian networks. Springer Berlin Heidelberg, 2008. 33-82.
Here's how Heckerman derives the BIC. First define $g(\boldsymbol{\theta}_s)$ (eq. 3.29, page 53):
\begin{align} g(\boldsymbol{\theta}_s) \equiv \log [p(D|\boldsymbol{\theta}_s, S^h)p(\boldsymbol{\theta}_s|S^h)] \end{align}
Then take a Taylor expansion (eq. 3.30, page 53):
\begin{align} g(\boldsymbol{\theta}_s) \approx g(\tilde{\boldsymbol{\theta}}_s) - \frac{1}{2}(\boldsymbol{\theta}_s - \tilde{\boldsymbol{\theta}}_s)A(\boldsymbol{\theta}_s - \tilde{\boldsymbol{\theta}}_s)^T \end{align}
where $A$ is the negative Hessian of $g(\boldsymbol{\theta}_s)$ evaluated at $\tilde{\boldsymbol{\theta}}_s$, and $\tilde{\boldsymbol{\theta}}_s = \arg \max_{\boldsymbol{\theta}_s} g(\boldsymbol{\theta}_s)$.
This lets us see that $p(\boldsymbol{\theta}_s|D, S^h)$ is approximately Gaussian (eq. 3.31, page 53):
\begin{align} \qquad \qquad p(\boldsymbol{\theta}_s|D, S^h) &\propto p(D| \boldsymbol{\theta}_s, S^h) p(\boldsymbol{\theta}_s|S^h) \\ &\approx p(D| \tilde{\boldsymbol{\theta}}_s, S^h) p(\tilde{\boldsymbol{\theta}}_s|S^h) \exp \{ - \frac{1}{2}(\boldsymbol{\theta}_s - \tilde{\boldsymbol{\theta}}_s)A(\boldsymbol{\theta}_s - \tilde{\boldsymbol{\theta}}_s)^T \} \end{align}
This means that we can evaluate the following integral in closed form (eq. 3.40, page 59):
\begin{align} p(D|S^h) = \int p(D|\boldsymbol{\theta}_s, S^h) p(\boldsymbol{\theta}_s| S^h) d\boldsymbol{\theta}_s \end{align}
Substituting in Eq 3.31 and taking the log, we have the Laplace approximation to the likelihood (eq. 3.41, page 59):
\begin{align} \log p(D|S^h) \approx \log p(D|\tilde{\boldsymbol{\theta}}_s, S^h) + \log p(\tilde{\boldsymbol{\theta}}_s| S^h) + \frac{d}{2}\log (2\pi) - \frac{1}{2} \log |A| \end{align}
where $d$ is the dimension of $g(\boldsymbol{\theta}_s)$.
To get BIC, Heckerman drops all the terms that don't depend on N. The remaining terms are $\log p(D|\tilde{\boldsymbol{\theta}}_s, S^h)$, which increases linearly with $N$, and $\log |A|$, which increases as $d \log N$. He also substitutes the ML estimate $\widehat{\boldsymbol{\theta}}_s$ for $\tilde{\boldsymbol{\theta}}_s$. This gives the familiar BIC score (eq 3.42, page 60):
\begin{align} \log p(D|S^h) \approx \log p(D|\widehat{\boldsymbol{\theta}}_s, S^h) - \frac{d}{2} \log N \end{align}
Here's the part I'm confused about: Why does $\log |A|$ increase as $d \log N$? I have no intuition for what the determinant of a Hessian should depend on, and Heckerman states this fact without explaining it.
It seems strange to me. The BIC is an approximation to the likelihood, not the posterior, so it shouldn't have a preference for simpler models -- the likelihood should never decrease as the model gets more complex. But the $-\frac{d}{2} \log N$ term introduces a penalty for large dimension $d$, which looks like a prior that favors simpler models.