Consider a classic machine learning problem, which we want to solve using NN. And suppose that we want to use bayesian learning for that.
In the bayesian approach the posterior is described as follows: $$P(W|X) = \frac{P(X|W)P(W)}{P(X)} $$
Where $W$ represents the weights of our network. In the bayesian setup, each weight is a distribution.
Because $P(X)$ is intractable, it is hard to compute the full posterior $P(W|X)$, and approximation methods like bayesian inference is needed.
In order to get a prediction, we now consider two alternatives:
Assuming we calculated the posterior, we will sample from the posterior and average the results (or compute the full integral if that would be possible).
Instead of computing the full posterior (which is hard), compute only the MAP (i.e. the most probable weight of the posterior) and then the computation is very easy, because we will only have one weight each time. It is also known that computing the MAP is not a hard task and it is equivalent to maximizing the likelihood with regularization.
Let's also assume that the posterior is modeled by a normal distribution.
I understand why full posterior is different from MAP and why we need to compute it e.g. it can also gives us the uncertainty in addition to the mean prediction.
But my question is:
In the specific setup of weights modeled as normal distribution, what is the mathematical difference between using only the mode and integrate over all weights?
i.e. Why computing the integral over all the weights will not produce the same result as using the most probable weight (assuming normal distribution)?