We are told (in Section 9.2.3, Deisenroth et al.: Mathematics for Machine Learning) that we can compute the posterior over a model's parameters $\boldsymbol\theta$ (here in the context of linear regression) given the data $\mathcal{X,Y}$ as
$$p(\boldsymbol\theta\mid\mathcal{X,Y})= \frac{p(\mathcal{Y}\mid\mathcal{X},\boldsymbol\theta)p(\boldsymbol\theta)}{p(\mathcal{Y}\mid\mathcal{X})}.$$
However, it seems to me that if we try to derive the RHS by laws of conditional probability we get
$$\begin{aligned} p(\boldsymbol\theta\mid\mathcal{X,Y}) &= \frac{p(\mathcal{Y},\mathcal{X},\boldsymbol\theta)}{p(\mathcal{Y},\mathcal{X})} \\ &= \frac{p(\mathcal{Y}\mid\mathcal{X},\boldsymbol\theta)p(\boldsymbol\theta\mid\mathcal{X})p(\mathcal{X})}{p(\mathcal{Y}\mid\mathcal{X})p(\mathcal{X})}. \end{aligned} $$
Under my derivation, it only seems to hold if we can write $p(\boldsymbol\theta\mid\mathcal{X})=p(\mathcal{\boldsymbol\theta})$. Is my derivation incorrect, or if not, why are these two random variables independent?