1

I'm reading some course notes on Bayesian statistics and in one of the slides entitled 'model evidence' it writes:

$$p(y|m)=\int{p(y,\theta|m)d\theta}=\int{p(y|\theta,m)p(\theta | m)d\theta}$$ "Because we have marginalised over $\theta$ the evidence is also known as the marginal likelihood."

I have two issues here.

  1. I don't understand how does $p(y,\theta|m)$ become $p(y|\theta,m)p(\theta | m)$. Is this derived from the multiplication rule of dependent events $P(A,B)=P(A|B)P(B)=P(B|A)P(A)$? If so - I don't see how. What is this linked to "we have marginalized over theta"?
  2. What does $m$ really stand for? I know what it's just supposed to stand for but it's just beyond my comprehension. How is it related to model parameters?
en1
  • 877
  • 1
  • 7
  • 9

1 Answers1

3
  1. Yes, it is. As you mentioned, the classical rule is $P(A,B) = P(A|B)P(B)$, but it can also be applied to conditional probabilities like $P(\cdot|C)$ instead of $P(\cdot)$. It then becomes

$$ P(A,B|C) = P(A|B,C)P(B|C) $$

(you just add a condition on $C$, but otherwise that's the same formula). You can then apply this formula for $A = y$, $B = \theta$, and $C = m$.

You know from the law of total probability that, if $\{B_n\}$ is a partition of the sample space, we obtain

$$ p(A) = \sum_n p(A,B_n) $$

or, using the first formula:

$$ p(A) = \sum_n p(A|B_n)p(B_n) $$

This easily extends to continuous random variables, by replacing the sum by an integral:

$$ p(A) = \int p(A|B)p(B) dB $$

The action of making $B$ "disappear" from $p(A,B)$ by integrating it over $B$ is called "marginalizing" ($B$ has been marginalized out). Once again, you can apply this formula for $A = y$, $B = \theta$, and $C = m$.

  1. $m$ is the model. Your data $y$ can have been generated from a certain model $m$, and this model itself has some parameters $\theta$. In this setting, $p(y|\theta,m)$ is the probability to have data $y$ from model $m$ parametrized with $\theta$, and $p(\theta|m)$ is the prior distribution of the parameters of model $m$.

For example, imagine you are trying to fit some data using either a straight line or a parabola. Your 2 models are thus $m_2$, where data are explained as $y = a_2 x^2 + a_1 x + a_0 + \epsilon$ ($\epsilon$ is just some random noise) and its parameters are $\theta_2 = [a_2 \ a_1 \ a_0]$ ; and $m_1$, where data are explained as $y = a_1 x + a_0 + \epsilon$ and its parameters are $\theta_1 = [ a_1 \ a_0]$.

For further examples, you can have a look at this paper, where we defined different models of synapse, each with different parameters : https://www.frontiersin.org/articles/10.3389/fncom.2020.558477/full

You can also have a look at the comments here : Formal proof of Occam's razor for nested models

Camille Gontier
  • 1,248
  • 3
  • 12
  • Thanks, this is very helpful. Regarding (1): So if I understand this correctly, writing $p(A|B,C)$ is basically done in order to avoid having a dubious syntax such as $p(A|(B|C))$? – en1 Oct 27 '20 at 13:52
  • 1
    Kind of, although I'm not sure you should think in terms of $p(A|(B|C))$, since this notation is misleading. Rather, without getting into the details of how probability functions are defined, you can just assume that the relation $p(A,B) = p(A|B)p(B)$ still holds if you condition the probabilities on $C$ (i.e. if you replace $p(\cdot)$ by the conditional probability $p(\cdot|C)$). – Camille Gontier Oct 27 '20 at 14:42
  • 1
    Regarding (2): So the likelihood $p(y|\theta ,m_1)$ for your example $m_1$ will be expressed by the pdf of $\epsilon$ and the structure of $m_1$. I wish I could upvote you several times - your answer has helped me a lot. – en1 Oct 29 '20 at 06:48
  • 1
    That's right : $p(y|\theta,m_1)$ is the probability of having $y$ given model $m_1$ with parameters $\theta = [a_1, a_0]$, and is thus a Gaussian pdf of mean $a_1 x + a_0$ – Camille Gontier Oct 29 '20 at 13:21
  • Camille, since you seem to be an expert on the topic, do you think you could also answer my [question on Bayesian model selection](https://stats.stackexchange.com/questions/494414/modern-applications-of-bayesian-model-selection)? – en1 Oct 31 '20 at 12:15