3

One of the ways to measure epistemic uncertainty, is using bayesian inference in neural networks. The idea is to learn the posterior over the weights $P(\phi|X)$ which describe the probability distribution over the weights. And then sample from this distribution, run the NN using this sample and calculate the mean and the variance of these results of the NN for all the samples. The variance should represent the uncertainty. We can assume using variational inference to approximate the posterior. I have two questions regarding that:

  1. I would expect that uncertainty will be dependent on the specific input i.e. for some parts in the feature space where we lack of data, the network will output high uncertainty and for other parts of the feature space, low uncertainty. If we only learn the posterior over the weights, will the network be able to differentiate between the different inputs? In some cases where the weights are connected directly to the input like categorical feature embedding, i guess it would work fine, but would it work in the general case?
  2. One way to solve it is learning the posterior over the output of the network layers instead of learning the posterior over the weights. This way it includes also the input. But in this case my question is: does the mathematical basis of the bayes theorem and variational inference still holds? Should something be changed in the theory\notations if we decide to learn the posterior over the network output layer.
kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
ofer-a
  • 1,008
  • 5
  • 9

1 Answers1

1

You are confusing posterior with posterior predictive distribution. Posterior distribution is the distribution of the parameter or parameters

$$ \underbrace{p(\theta | X)}_{\text{posterior}} \propto \underbrace{p(X|\theta)}_{\text{likelihood}} \; \underbrace{p(\theta)}_{\text{prior}} $$

So it's variance tells you about the uncertainty of the parameters, given what you've learned from the data and the prior.

Posterior predictive distribution tells you about the uncertainty of the value predicted by the model $\tilde x$

$$ p(\tilde x | X) = \int_\Theta p(\tilde x | X, \theta) \;\underbrace{p(\theta | X)}_{\text{posterior}} \; d\theta $$

We run things as posterior predictive checks to check how well the posterior distribution of the predictions reflects the empirical distribution of the data. You can think of it (loosely) in similar terms as in the case of the difference between confidence intervals and predictive intervals.

So, yes, posterior is only about uncertainty of the weights, and yes, you can use posterior predictive to look at the uncertainty of the predictions.

Tim
  • 108,699
  • 20
  • 212
  • 390
  • I know about the posterior predictive distribution, but i wasn't sure how it can produce different variance to different inputs. For categorical values in which they have their own weight it is clear. But let's take for example numerical feature and a single weight . I would expect that for certain values of this feature which were common in the dataset, the variance will be low and for others , rare values, the variance will be high. But when I sample the same weight for different values, shouldn't i expect similar variance (which depend on the weight variance) for the different values? – ofer-a Oct 01 '21 at 11:06
  • @ofer-a if you build a model, where the weight itself is a function of the data, then yes. But if you only have it as a parameter in the model, then no. Your model is a function of data and parameters $f(X, \theta)$, the parameters a posteriori depend on the training data, but the result you get from $f(X, \theta)$ depends on the already estimated parameters ("fixed") and the new data. – Tim Oct 01 '21 at 11:24
  • Example: you have a trivial $E[y] = \beta x$ model. To make a prediction, you just multiply $x$ by the parameter $\beta$, the parameter does not magically change depending on the data it is multiplied by. The result is a result of multiplying the random variable $\beta$ by $x$, so it is a random variable itself. How much the parameter varies depends on your training data and the prior (Bayes theorem), but nothing else. – Tim Oct 01 '21 at 11:32
  • Exactly. In your last example, we learn the posterior for $\beta$ and then in order to do a prediction we sample form $\beta$ and multiply by $x$, and then we can calculate the mean and the variance. But the variance will not be dependent on how well the model knows $x$ i.e. how confident the model is about the different values of $x$ and this is what we aim for when we want to measure uncertainty. – ofer-a Oct 01 '21 at 12:00
  • let's take for example a model trying to predict height of babies based on the mother's height. I would expect that for some values of mother height that are common in the dataset the variance of the prediction will be lower. Maybe the solution is that in order to achieve that, the model needs to have lots of parameters in order to be expressive enough for the different cases. – ofer-a Oct 01 '21 at 12:05
  • 1
    @ofer-a you seem to have answered yourself: either you have a model that behaved differently for different scenarios, hence it has different uncertainties for them, or you have a model that treats them all the same. Gaussian process is an example of a model where the uncertainties depend on how much data you have. – Tim Oct 01 '21 at 12:19
  • Thanks. Makes sense. – ofer-a Oct 01 '21 at 12:25
  • @ofer-a moreover is $\beta$ and $x$ are random variables, then $\beta x$ is a function of two random variables, so the results that you would obtain from $\beta x$ for the values of $x$ that have low probabilities, also would have smaller probabilities of being observed. But this follows from the variability of the input data, not the model, for the model the variability follows only from the variability of the parameters. – Tim Oct 01 '21 at 12:33