0

Consider a classic machine learning problem, which we want to solve using NN. And suppose that we want to use bayesian learning for that.

In the bayesian approach the posterior is described as follows: $$P(W|X) = \frac{P(X|W)P(W)}{P(X)} $$

Where $W$ represents the weights of our network. In the bayesian setup, each weight is a distribution.

Because $P(X)$ is intractable, it is hard to compute the full posterior $P(W|X)$, and approximation methods like bayesian inference is needed.

In order to get a prediction, we now consider two alternatives:

  1. Assuming we calculated the posterior, we will sample from the posterior and average the results (or compute the full integral if that would be possible).

  2. Instead of computing the full posterior (which is hard), compute only the MAP (i.e. the most probable weight of the posterior) and then the computation is very easy, because we will only have one weight each time. It is also known that computing the MAP is not a hard task and it is equivalent to maximizing the likelihood with regularization.

Let's also assume that the posterior is modeled by a normal distribution.

I understand why full posterior is different from MAP and why we need to compute it e.g. it can also gives us the uncertainty in addition to the mean prediction.
But my question is:

In the specific setup of weights modeled as normal distribution, what is the mathematical difference between using only the mode and integrate over all weights?

i.e. Why computing the integral over all the weights will not produce the same result as using the most probable weight (assuming normal distribution)?

ofer-a
  • 1,008
  • 5
  • 9
  • 1
    The purpose of a Bayesian analysis is not to produce a point estimator. And [in my opinion](https://xianblog.wordpress.com/2016/11/30/map-as-bayes-estimators/) MAP estimators are not Bayesian estimators. – Xi'an Sep 11 '21 at 15:27
  • @Xi'an only partially. I fully understand that they are not the same, and knowing the full posterior also enables you answering more questions like uncertainty and other use cases. But i am interested in a specific setup in which i use NN and the weights are modeled as a normal distribution. In this setup does the average over the samples of the normal distribution will be different than using the mode. – ofer-a Sep 11 '21 at 16:02

2 Answers2

3

Michael Betancourt writes a little about this in this paper. Here is an excerpt of his paper:

This assumption implies that the variation in the integrand is dominated by the target density, and hence we should consider the neighborhood around the mode where the density is maximized. This intuition is consistent with the many statistical methods that utilize the mode, such as maximum likelihood estimators and Laplace approximations, although conflicts with our desire to avoid the specific details of the target density. Indeed, this intuition is fatally naive as it misses a critical detail.

Expectation values are given by accumulating the integrand over a volume of parameter space and, while the density is largest around the mode, there is not much volume there. To identify the regions of parameter space that dominate expectations we need to consider the behavior of both the density and the volume. In high-dimensional spaces the volume behaves very differently from the density, resulting in a tension that concentrates the significant regions of parameter space away from either extreme

In that paper, Betancourt motivates the use of expectations with respect to some target distribution as the real quantities of interest in Bayesian stats. Because expectations are functions not only of probability density but of volume as well, focusing solely on the mode of a target distribution is to focus on a region with high density and small volume. Betancourt goes on to say that we need to focus on regions of parameter space where the product of density and volume is high. This is referred to as the typical set.

Additionally, I wrote a paper in 2020 comparing HMC (which is slow to sample for large problems) with MAP (which is faster). Though MAP and HMC have similar point predictions, the normal approximation used by MAP greatly overstates uncertainty. In applications where we need to make decisions about what to do, this uncertainty can greatly effect our decision quality, as we demonstrate via simulation.

Demetri Pananos
  • 24,380
  • 1
  • 36
  • 94
  • Thanks, is it also true if we model our NN weights as normal distribution? Can you give example to why the volume matter in this case? – ofer-a Sep 11 '21 at 14:49
  • @ofer-a I've not had the time or interest to study fully Bayesian neural nets, but I imagine the principles would still apply. – Demetri Pananos Sep 11 '21 at 17:18
2

The difference is that with full posterior you estimate the distribution of the parameters. With MAP you find only the mode of the distribution. You cannot really compare them, it’s like asking how all the people’s age differs from the average age of the whole population. It obviously does.

Another difference is practical, you use different algorithms for both cases, so it can happen that if you estimate the full posterior and than estimate it’s mode it would differ from the results of an optimization algorithm that only looked for the mode. This is especially the case since in most cases for full posterior you would either draw MCMC samples from it and approximate the mode using the mode from the samples, or use other way of approximating the posterior and find the mode of the approximation.

Tim
  • 108,699
  • 20
  • 212
  • 390
  • Assuming the mode is the same, and assuming normal distribution, why mathematically, the final answer will be different? In one case you sample from a normal distribution and average the samples and in the second case you use the average of the normal distribution. – ofer-a Sep 11 '21 at 14:18
  • @ofer-a finding the mode of the distribution or finding the distribution and then taking its mode are the same. The only difference is that in practice you would use different algorithms in both cases, so there’s no guarantee the results would be exactly the same. – Tim Sep 11 '21 at 14:30