First, use a conditional Bayes' rule where we keep conditioning on $X$, $\theta$, and $\mathcal H_i$, and only swap $\mathbf w$ and $\mathbf y$:
$$
p(\mathbf w \mid \mathbf y, X, \boldsymbol\theta, \mathcal H_i)
= \frac{p(\mathbf y \mid \mathbf w, X, \boldsymbol\theta, \mathcal H_i) \; p(\mathbf w \mid X, \boldsymbol\theta, \mathcal H_i)}{p(\mathbf y \mid X, \boldsymbol\theta, \mathcal H_i)}
.$$
Now, (5.3) will follow from this if we establish both of
\begin{gather*}
p(\mathbf y \mid \mathbf w, X, \boldsymbol\theta, \mathcal H_i)
= p(\mathbf y \mid \mathbf w, X, \mathcal H_i)
\tag{1}
\\
p(\mathbf w \mid X, \boldsymbol\theta, \mathcal H_i)
= p(\mathbf w \mid \boldsymbol\theta, \mathcal H_i)
\tag{2}
.\end{gather*}
But remember what these things are:
- $X$ is the matrix of training inputs
- $\mathbf y$ is the vector of training outputs
- $\mathbf w$ is the vector of parameters, e.g. the weights of a neural network
- $\boldsymbol\theta$ is the vector of hyperparameters, e.g. the regularization weight
- $\mathcal H_i$ is which of several discrete model classes we're considering using
So $\theta$ only determines how we choose $\mathbf w$ given the other stuff. If we're conditioning on $\mathbf w$ itself, then there's no added information by also conditioning on $\boldsymbol\theta$: $\mathbf y$ is independent of $\boldsymbol\theta$ given $\mathbf w$. Thus (1) holds.
(2) assumes that $\mathbf w$ doesn't depend on $X$ when we don't also have $\mathbf y$: that is, we know which points we're training on, but not the labels. This is actually kind of an assumption, one which I think was more reasonable to make in 2006 when this book was written than it would be today. For instance, if we're running (stochastic) gradient descent on a linear model, then we actually know that the difference between our final weights $\hat{\mathbf w}$ and the initial weights $\mathbf w_0$ will be in the span of the data $X$ – if $\mathbf w_0$ is small and the dimension of the data is much higher than the number of samples, this is actually very important. But, machine learning people have only really realized how important this kind of "implicit bias" is in the past few years, and it's quite tough to model, so assuming $\mathbf w$ is approximately independent of just $X$ given $\boldsymbol\theta$ and $\mathcal H_i$ is probably close enough.
Also: it is not true as you say in your question that $p(\boldsymbol\theta \mid X, \mathbf y, \mathbf w, \mathcal H_i) = 1$ or $p(X \mid \mathbf w, \mathcal H_i) = 1$.