49

I understand what a Posterior is, but I'm not sure what the latter means?

How are the 2 different?

Kevin P Murphy indicated in his textbook, Machine Learning: a Probabilistic Perspective, that it is "an internal belief state". What does that really mean? I was under the impression that a Prior represents your internal belief or bias, where am I going wrong?

A.D
  • 2,114
  • 3
  • 17
  • 27

3 Answers3

59

The simple difference between the two is that the posterior distribution depends on the unknown parameter $\theta$, i.e., the posterior distribution is: $$p(\theta|x)=c\times p(x|\theta)p(\theta)$$ where $c$ is the normalizing constant.

While on the other hand, the posterior predictive distribution does not depend on the unknown parameter $\theta$ because it has been integrated out, i.e., the posterior predictive distribution is: $$p(x^*|x)=\int_\Theta c\times p(x^*,\theta|x)d\theta=\int_\Theta c\times p(x^*|\theta)p(\theta|x)d\theta$$

where $x^*$ is a new unobserved random variable and is independent of $x$.

I won't dwell on the posterior distribution explanation since you say you understand it but the posterior distribution "is the distribution of an unknown quantity, treated as a random variable, conditional on the evidence obtained" (Wikipedia). So basically its the distribution that explains your unknown, random, parameter.

On the other hand, the posterior predictive distribution has a completely different meaning in that it is the distribution for future predicted data based on the data you have already seen. So the posterior predictive distribution is basically used to predict new data values.

If it helps, is an example graph of a posterior distribution and a posterior predictive distribution:

enter image description here

enter image description here

Jinhua Wang
  • 173
  • 1
  • 7
  • 3
    That posterior predictive distribution graph needs new axis labels and a caption or something. I get the idea because I know what a posterior predictive distribution is, but someone who's just figuring it out could get seriously confused. – Cyan Sep 25 '13 at 20:24
  • Thanks @BabakP could you also point me to what what distribution you used to plot the pmf of theta, and P(x*|theta) – A.D Sep 26 '13 at 03:42
  • ...cause I would like to work out the full example. – A.D Sep 26 '13 at 03:48
  • I just pretended that my posterior was a Beta(3,2). I did not actually work out anything. But of course, if you want an example, assume the likelihood is a Binomial(n,p) and the prior on p is a Beta(a,b) then you should be able to obtain that the posterior is once again a beta distribution. –  Sep 26 '13 at 05:03
  • As well, that posterior predictive is not an easy one to derive. I just grabbed a graph from some Gaussian Process code I wrote for a GP posterior predictive. And with that said, that posterior and that posterior predictive plot above does not actually correspond to the posterior shown, they are both arbitrary. –  Sep 26 '13 at 05:04
  • Oh ok, thanks that is good to know. I suppose it would be a little easier in the discrete case? Then the posterior predictive would just be a weighted sum of possible outputs with the posterior. Is that correct? – A.D Sep 26 '13 at 14:57
  • 2
    There is a wording problem. The posterior distribution does not depend on the parameter. It is a function of the parameter but one does does not need to know the true value of the parameter. The distinction is whether you want a distribution of $\theta$ or a distribution of data. – Frank Harrell Jun 24 '18 at 13:39
  • Why does the posterior predictive distribution involve an integral? – MJimitater May 27 '20 at 08:05
  • Can anybody please confirm that the normalization constant, $c$, is needed in the second formula? – Ivan Aug 05 '20 at 14:54
17

They refer to distributions of two different things.

The posterior distribution refers to the distribution of the parameter, while the predictive posterior distribution (PPD) refers to the distribution of future observations of data.

SPMQET
  • 391
  • 2
  • 8
  • 1
    So, if I understand this correctly, if the true likelihood distribution (or true distribution where the data comes from) is Gaussian, then as we gather more and more data (observations), the PPD should converge towards a Gaussian (for some parameter $\theta$), while the posterior distribution should converge towards a spike at the true parameter $\theta$? – SimpleProgrammer Oct 29 '20 at 20:03
17

The predictive distribution is usually used when you have learned a posterior distribution for the parameter of some sort of predictive model. For example in Bayesian linear regression, you learn a posterior distribution over the w parameter of the model y=wX given some observed data X.
Then when a new unseen data point x* comes in, you want to find the distribution over possible predictions y* given the posterior distribution for w that you just learned. This distribution over possible y*'s given the posterior for w is the prediction distribution.

user1893354
  • 1,435
  • 4
  • 15
  • 25