In the paper Practical Variational Inference for Neural Networks, by Alex Graves, equation 1 equates the joint distribution of the dataset $\mathcal{D}$ to the product of the conditional distributions of the labels $\mathbf{y}$ given the inputs $\mathbf{x}$.
\begin{align} -\ln \operatorname{Pr}(\mathcal{D} \mid \mathbf{w}) &=-\sum_{(\mathbf{x}, \mathbf{y}) \in \mathcal{D}} \ln \operatorname{Pr}(\mathbf{y} \mid \mathbf{x}, \mathbf{w}) \label{1} \tag{1} \end{align}
For simplicitly, let's ignore the log (I know what the log does in this case!) and the minus sign, so we can write equation 1 as follows.
\begin{align} \operatorname{Pr}(\mathcal{D} \mid \mathbf{w}) &=\prod_{(\mathbf{x}, \mathbf{y}) \in \mathcal{D}}\operatorname{Pr}(\mathbf{y} \mid \mathbf{x}, \mathbf{w}) \label{2} \tag{2} \end{align}
Let's assume that $(\mathbf{X}=\mathbf{x}, \mathbf{Y}=\mathbf{y})$ are i.i.d. from $p(\mathbf{X}, \mathbf{Y})$ (which is also assumed in the cited paper), then
\begin{align} \operatorname{Pr}(\mathcal{D} \mid \mathbf{w}) &=\prod_{(\mathbf{x}, \mathbf{y}) \in \mathcal{D}}\operatorname{Pr}(\mathbf{x}, \mathbf{y} \mid \mathbf{w}) \label{3} \tag{3} \end{align}
How does equation \ref{3} become equation \ref{2}? What's the proof? I am looking for a proof. If it's an equation, there must be a proof. In general, $p(x, y) = p(x \mid y) p(y) = p(y \mid x) p(x)$, so the only way equation \ref{3} can become equation \ref{2} is if $p(\mathbf{y}) = p(\mathbf{x}) = 1$. If that's the case, why would this assumption be true?
I've seen equation \ref{1} in many other articles and papers, so either they are all wrong or I am not seeing the proof here.