7

I am currently reading Murphy's ML: A Probabilistic Perspective. In CH 3 he explains that a batch update of the posterior is equivalent to a sequential update of the posterior, and I am trying to understand this in the context of his example.

Suppose $D_a$ and $D_b$ are two data sets and $\theta$ is the parameter to our model. We are trying to update the posterior $P(\theta \mid D_a, D_b)$. In a sequential update, he states that, $$ (1) \ \ \ \ \ \ \ \ P(\theta \mid D_{a}, D_{b}) \propto P(D_b \mid \theta) P(\theta \mid D_a) $$ However, I am slightly confused as to how he got this mathematically. Conceptually, I understand that he is saying the posterior $P(\theta \mid D_a)$ is now a prior used to update the new posterior, which includes the new data $D_b$, and is multiplying this prior with the likelihood $P(D_b \mid \theta)$. Expanding the last statement out, I have, $$ P(D_b \mid \theta) P(\theta \mid D_a) = P(D_b \mid \theta) P(D_a \mid \theta) P(\theta) $$

but are we allowed to say $P(D_a \mid \theta) P(D_b \mid \theta) = P(D_a, D_b \mid \theta)$ in order to make the connection in (1)?

Shota
  • 71
  • 1
  • 2
  • 1
    Concerning your last question $P(D_a \mid \theta) P(D_b \mid \theta) = P(D_a, D_b \mid \theta)$ if $D_a$ and $D_b$ are conditionally independent given $\theta$ (https://en.wikipedia.org/wiki/Conditional_independence). It depends on the model but is commonly considered as true for most models. Using this and dropping the term P(D_b,D_a) (does not depending from $\theta$) gives you (1). – peuhp Nov 16 '15 at 09:30
  • I do not know Murphy's book but reading Christopher Bishop's widely known ML-book had me for once step back to get a more solid Bayesian inference background first. For this I would recommend having a look at the works of [Gregory](http://bit.ly/1S0Yhnx) (do not mind ref to *Mathematica*), [Sivia/Skilling](http://bit.ly/1kBNXYP) or if you really love concise maths [Koch](http://bit.ly/20Z2uy7), which is a very compact reference kind of book. Bayesian inference is about doing mathematical statistics... – gwr Nov 17 '15 at 11:18
  • In the context of a linear regression, what is $D$? Does it refer to the dependent variable (target) or the independent variables (features)? – wwl Apr 23 '19 at 20:22

2 Answers2

2

Indeed - you can update sequentially or in a batch fashion so long as you assume exchangeability. It's analogous to the iid assumption typically made in frequentist models.

In this case, $D_{a}$ and $D_{b}$ exchangeable implies that $P(D_{a}, D_{b} \, | \, \theta) = P(D_{a} \, | \, \theta) P(D_{b} \, | \, \theta)$ for some $\theta$, which is exactly what you need to make the connection.

You can see a proof of equivalence between a single $n$-large batch update and $n$ sequential updates in an answer I wrote to a similar question.

jtobin
  • 1,446
  • 8
  • 9
2

Actually, the general formula of sequential Bayesian updating is: $$ P(\theta \mid D_{a}, D_{b}) \propto P(D_b \mid \theta, D_a) P(\theta \mid D_a). \,\,\,(*) $$ However, for most machine learning models, $D_a$ and $D_b$ are conditionally independent given $\theta$, i.e., $$ P(D_a \mid \theta) P(D_b \mid \theta) = P(D_a, D_b \mid \theta), $$ then, $P(D_b \mid \theta, D_a)$ in $(*)$ naturally equals to $P(D_b \mid \theta), $ so the $(*)$ becomes: $$ (1)\,\,\,\,\,\,P(\theta \mid D_{a}, D_{b}) \propto P(D_b \mid \theta) P(\theta \mid D_a), $$ which is exactly what Murphy's ML book talks about.

lynnjohn
  • 151
  • 7