Obviously computing the inverse Hessian is hard when a probability distribution is fitted on high-dimensional datapoints. One idea to reduce computational cost would be to approximate the distribution with a diagonal Gaussian. Are there any standard efficient techniques allowing to make efficient use of the Laplace approximation in high-dimensions?
2 Answers
Are there any standard efficient techniques allowing to make efficient use of the Laplace approximation in high-dimensions?
Assuming that you are using a Laplace approximation for a random effect model, often groups of outcomes are conditional independent given the random effects. These model have a Hessian which sparse making it easy to work with high-dimensional problems. As an example, consider the following mixed logistic regression
$$ \begin{align} y_{ij} &\sim \text{Bin}(\text{logit}(b_i + u_i), 1) & j &= 1,\dots 5 \\ u_i &\sim N(0, \sigma^2_u) & i &= 1, \dots, k \\ b_i &\sim N(0, \sigma^2_b) & i &= 1, \dots, k \end{align} $$
We can simulate from this model with $k = 3$ and plot the Hessian as follow
# simulate data
n <- 5
k <- 3
sig <- 1
set.seed(1)
dat <- do.call(rbind, lapply(1:k, function(i){
u <- rnorm(2, sd = sig)
x <- runif(n, -1, 1)
y <- 1/(1 + exp(-u[1] - u[2] * x)) < runif(n)
data.frame(y = y, x = x, grp = i)
}))
# approximate the hessian
library(numDeriv)
ll <- function(u){
i_start <- (dat$grp - 1) * 2
p_hat <- 1/(1 + exp(-u[i_start + 1] - dat$x * u[i_start + 2]))
sum(log(ifelse(dat$y > 0, p_hat, 1 - p_hat))) +
sum(dnorm(u, sd = sig, log = TRUE))
}
he <- hessian(ll, numeric(2 * k))
he[abs(he) < 1e-3] <- 0 # threshold just as an example here
print(he, digits = 2)
#R> [,1] [,2] [,3] [,4] [,5] [,6]
#R> [1,] -2.25 -0.42 0.00 0.00 0.00 0.00
#R> [2,] -0.42 -1.49 0.00 0.00 0.00 0.00
#R> [3,] 0.00 0.00 -2.25 -0.23 0.00 0.00
#R> [4,] 0.00 0.00 -0.23 -1.39 0.00 0.00
#R> [5,] 0.00 0.00 0.00 0.00 -2.25 -0.23
#R> [6,] 0.00 0.00 0.00 0.00 -0.23 -1.32
For this model, the Hessian is just a diagonal block matrix with 2x2 block matrices. Thus, it is easy to invert and to compute the determinant of. The number of non-zero entries of the Hessian is only $2^2k$ rather than $k^2$. This is the basis of much software like the lme4 package and the TMB package.
One idea to reduce computational cost would be to approximate the distribution with a diagonal Gaussian.
There are of course settings where the Hessian is not sparse. These include some Gaussian processes and autoencoders. Assuming a diagonal covariance matrix is a kind to the gaussian variational approximations used in many machine learning applications. However, I suspect that one might opt for some low rank approximation of the determinant instead.

- 2,186
- 10
- 31
You may want to look into the SR1 and BFGS (and its dual, DFP) optimization methods. Those iteratively update the approximated inverse Hessian using rank 1 (SR1) and rank 2 (BFGS) updates. They are also fairly efficient and can likely handle your problem.

- 1,560
- 2
- 14
-
1The OP wants to apply a Laplace approximation so he needs to compute the determinant of the Hessian regardless of the optimizer he is using. – Benjamin Christoffersen Jul 24 '20 at 07:32
-
Yes, but if computing the inverse Hessian is hard, I've given OP a few methods to make that easier -- which is OP's ultimate goal ("idea to reduce computational cost"). – kurtosis Jul 24 '20 at 07:35
-
1But that will not help the OP as the OP needs to evaluate the determinant? Using SR1 and BFGS approximations would still yield a dense Hessian approximation which the OP needs compute a determinant of in order to apply a Taylor approximation. This is computationally hard for larger problems which I guess is the reason that the OP asks for a diagonal approximation instead. – Benjamin Christoffersen Sep 03 '20 at 14:20