2

We are told (in Section 9.2.3, Deisenroth et al.: Mathematics for Machine Learning) that we can compute the posterior over a model's parameters $\boldsymbol\theta$ (here in the context of linear regression) given the data $\mathcal{X,Y}$ as

$$p(\boldsymbol\theta\mid\mathcal{X,Y})= \frac{p(\mathcal{Y}\mid\mathcal{X},\boldsymbol\theta)p(\boldsymbol\theta)}{p(\mathcal{Y}\mid\mathcal{X})}.$$

However, it seems to me that if we try to derive the RHS by laws of conditional probability we get

$$\begin{aligned} p(\boldsymbol\theta\mid\mathcal{X,Y}) &= \frac{p(\mathcal{Y},\mathcal{X},\boldsymbol\theta)}{p(\mathcal{Y},\mathcal{X})} \\ &= \frac{p(\mathcal{Y}\mid\mathcal{X},\boldsymbol\theta)p(\boldsymbol\theta\mid\mathcal{X})p(\mathcal{X})}{p(\mathcal{Y}\mid\mathcal{X})p(\mathcal{X})}. \end{aligned} $$

Under my derivation, it only seems to hold if we can write $p(\boldsymbol\theta\mid\mathcal{X})=p(\mathcal{\boldsymbol\theta})$. Is my derivation incorrect, or if not, why are these two random variables independent?

  • Some post looking at this from frequentist pow: https://stats.stackexchange.com/questions/215230/what-are-the-differences-between-stochastic-v-s-fixed-regressors-in-linear-regr/417324#417324, https://stats.stackexchange.com/questions/144826/what-is-the-difference-between-conditioning-on-regressors-vs-treating-them-as-f/192746#192746 – kjetil b halvorsen Nov 05 '20 at 22:32

1 Answers1

3

In the usual setting, $\mathcal{X}$ is simply not considered a random variable, but is instead considered deterministic! Therefore, $p(\theta|\mathcal{X})=p(\theta)$. This assumption is of course often violated in practice. There are loads of discussion of this issue on the web, see, e.g. wiki1, wiki2 and this answer.

In particular, I can recommend Buja's tutorial-style article on this subject (2014). To quote:

the predictors are treated as known constants even when they arise as random observations just like the response. Statisticians have long enjoyed the fruits that can be harvested from this model and they have taught it as fundamental at all levels of statistical education. Curiously little known to many statisticians is the fact that a different modeling framework is adopted and a different statistical education is taking place in the parallel universe of econometrics. For over three decades, starting with Halbert White’s (1980a, 1980b, 1982) seminal articles, econometricians have used multiple linear regression without making the many assumptions of classical linear models theory. While statisticians use assumption-laden exact finite sample inference, econometricians use assumption-lean asymptotic inference based on the so-called “sandwich estimator” of standard error.

jhin
  • 749
  • 4
  • 12
  • Thanks so much for the super answer--these links are very useful :-). – orthonormal-stice Jul 02 '20 at 16:07
  • 1
    @orthonormal-stice Happy to help! I have to say that in statistics much more than in other fields, I find it especially hard to figure out how things really work, because there are so many inherent assumptions nobody talks about, and because there are often whole subfields devoted to particular topics which you never find out about, unless you know the secret key words... – jhin Jul 02 '20 at 16:23