2

I'm having some trouble following the logic of this passage from Chapter 14 in Bayesian Data Analysis, A. Gelman:

The numerical 'data' in a regression problem includes both $X$ and $y$. Thus, a full Bayesian model includes a distribution for $X$, $p(X|\psi$), indexed by a parameter vector $\psi$, and thus involves a joint likelihood $p(X,y|\psi,\theta)$, along with a prior distribution, $p(\psi,\theta)$. In the standard regression context, the distribution of $X$ is assumed to provide no information about the conditional distribution of $y$ given $X$; that is, we assume prior independence of parameters $\theta$ determining $p(y|X,\theta)$ and the parameters $\psi$ determining $p(X|\psi).$

Thus, from a Bayesian perspective, the defining characteristic of a 'regression model' is that it ignores the information supplied by $X$ about ($\psi$, $\theta$). How can this be justified? Suppose $\psi$ and $\theta$ are independent in their prior distribution; that is $p(\theta,\psi) = p(\theta)p(\psi)$. Then the posterior distribution factors,

$p(\psi,\theta|X,y) = p(\psi|X)p(\theta|X,y)$, [...]

When I work this out I can't obtain the last line. I can get

$p(\psi,\theta|X,y) = p(\psi|X,y,\theta)p(\theta|X,y)$.

Intuitively the statement makes sense, but I can't prove to myself that it is true.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
PeteyCoco
  • 43
  • 4

2 Answers2

2

It would help here drawing a DAG (or causal diagram) representing the dependencies among the random variables $\psi,\theta,X,y$. It would be $$ \psi \rightarrow X \rightarrow y \\ ~~~~~~~ \theta \nearrow $$Referring to this diagram will help with the calculations.

Then we can do the calculations $$\begin{align} p(\psi,\theta \mid X,y)&=&\frac{p(X,y\mid \psi,\theta) p(\psi)p(\theta)}{p(X,y)}\\ &=& \frac{p(y\mid X,\psi,\theta)\cdot p(X\mid\psi)p(\psi)p(\theta)}{p(y\mid X)p(X)} \\ &=& \frac{p(X\mid\psi)p(\psi)}{p(X)}\cdot \frac{p(y\mid X,\psi,\theta)p(\theta)}{p(y\mid X)} \\ &=& p(\psi\mid X)\cdot \frac{p(y,X,\psi,\theta)p(\theta)/ p(X,\psi,\theta)}{p(y\mid X)} \\ &=& p(\psi\mid X)\cdot\frac{p(\theta\mid y,X,\psi)p(y,X,\psi)p(\theta)}{p(y\mid X)p(X,\psi\mid \theta)p(\theta)}\\ &=& p(\psi\mid X)\cdot\frac{p(\theta\mid y,X,\psi)p(y\mid X,\psi)}{p(y\mid X)}\\ &=& p(\psi\mid X)\cdot p(\theta \mid X,y) \end{align}$$ This could be compared with the frequentist argument in What is the difference between conditioning on regressors vs. treating them as fixed?.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
1

Once you've conditioned on $X$ there is no further information in $Y$ or $\theta$ concerning $\psi$, so $p(\psi|X,\theta,y) = p(\psi|X)$. This is a consequence of the independent priors on $\theta$ and $\psi$.

Concretely, in the full conditional distribution for $\psi$, which is proportional to the full joint distribution, you can factor out $p(y|X,\theta)$ and $p(\theta)$: $$p(\psi|X, \theta,y) \propto p(\psi)\;p(\theta)\;p(X|\psi)\;p(Y|X,\theta) \propto p(\psi) \;p(X|\psi)\propto p(\psi|X)$$

HStamper
  • 1,396
  • 9
  • 12