Formal Bayesian justification of conditional modelling

Question

I'm having some trouble following the logic of this passage from Chapter 14 in Bayesian Data Analysis, A. Gelman:

The numerical 'data' in a regression problem includes both $X$ and $y$. Thus, a full Bayesian model includes a distribution for $X$, $p(X|\psi$), indexed by a parameter vector $\psi$, and thus involves a joint likelihood $p(X,y|\psi,\theta)$, along with a prior distribution, $p(\psi,\theta)$. In the standard regression context, the distribution of $X$ is assumed to provide no information about the conditional distribution of $y$ given $X$; that is, we assume prior independence of parameters $\theta$ determining $p(y|X,\theta)$ and the parameters $\psi$ determining $p(X|\psi).$

Thus, from a Bayesian perspective, the defining characteristic of a 'regression model' is that it ignores the information supplied by $X$ about ($\psi$, $\theta$). How can this be justified? Suppose $\psi$ and $\theta$ are independent in their prior distribution; that is $p(\theta,\psi) = p(\theta)p(\psi)$. Then the posterior distribution factors,

$p(\psi,\theta|X,y) = p(\psi|X)p(\theta|X,y)$, [...]

When I work this out I can't obtain the last line. I can get

$p(\psi,\theta|X,y) = p(\psi|X,y,\theta)p(\theta|X,y)$.

Intuitively the statement makes sense, but I can't prove to myself that it is true.

score 2 · Answer 1 · answered May 01 '19 at 21:45

It would help here drawing a DAG (or causal diagram) representing the dependencies among the random variables $\psi,\theta,X,y$. It would be $$ \psi \rightarrow X \rightarrow y \\ ~~~~~~~ \theta \nearrow $$Referring to this diagram will help with the calculations.

Then we can do the calculations $$\begin{align} p(\psi,\theta \mid X,y)&=&\frac{p(X,y\mid \psi,\theta) p(\psi)p(\theta)}{p(X,y)}\\ &=& \frac{p(y\mid X,\psi,\theta)\cdot p(X\mid\psi)p(\psi)p(\theta)}{p(y\mid X)p(X)} \\ &=& \frac{p(X\mid\psi)p(\psi)}{p(X)}\cdot \frac{p(y\mid X,\psi,\theta)p(\theta)}{p(y\mid X)} \\ &=& p(\psi\mid X)\cdot \frac{p(y,X,\psi,\theta)p(\theta)/ p(X,\psi,\theta)}{p(y\mid X)} \\ &=& p(\psi\mid X)\cdot\frac{p(\theta\mid y,X,\psi)p(y,X,\psi)p(\theta)}{p(y\mid X)p(X,\psi\mid \theta)p(\theta)}\\ &=& p(\psi\mid X)\cdot\frac{p(\theta\mid y,X,\psi)p(y\mid X,\psi)}{p(y\mid X)}\\ &=& p(\psi\mid X)\cdot p(\theta \mid X,y) \end{align}$$ This could be compared with the frequentist argument in What is the difference between conditioning on regressors vs. treating them as fixed?.

score 1 · Accepted Answer · answered May 01 '19 at 18:38

Once you've conditioned on $X$ there is no further information in $Y$ or $\theta$ concerning $\psi$, so $p(\psi|X,\theta,y) = p(\psi|X)$. This is a consequence of the independent priors on $\theta$ and $\psi$.

I see. I wasn't taking advantage of the fact that the other terms were normalization constants. — PeteyCoco, May 01 '19 at 19:22

Formal Bayesian justification of conditional modelling

2 Answers2