Why is the joint distribution of the data set equal to the product of conditional distributions?

Question

In the paper Practical Variational Inference for Neural Networks, by Alex Graves, equation 1 equates the joint distribution of the dataset $\mathcal{D}$ to the product of the conditional distributions of the labels $\mathbf{y}$ given the inputs $\mathbf{x}$.

\begin{align} -\ln \operatorname{Pr}(\mathcal{D} \mid \mathbf{w}) &=-\sum_{(\mathbf{x}, \mathbf{y}) \in \mathcal{D}} \ln \operatorname{Pr}(\mathbf{y} \mid \mathbf{x}, \mathbf{w}) \label{1} \tag{1} \end{align}

For simplicitly, let's ignore the log (I know what the log does in this case!) and the minus sign, so we can write equation 1 as follows.

\begin{align} \operatorname{Pr}(\mathcal{D} \mid \mathbf{w}) &=\prod_{(\mathbf{x}, \mathbf{y}) \in \mathcal{D}}\operatorname{Pr}(\mathbf{y} \mid \mathbf{x}, \mathbf{w}) \label{2} \tag{2} \end{align}

Let's assume that $(\mathbf{X}=\mathbf{x}, \mathbf{Y}=\mathbf{y})$ are i.i.d. from $p(\mathbf{X}, \mathbf{Y})$ (which is also assumed in the cited paper), then

\begin{align} \operatorname{Pr}(\mathcal{D} \mid \mathbf{w}) &=\prod_{(\mathbf{x}, \mathbf{y}) \in \mathcal{D}}\operatorname{Pr}(\mathbf{x}, \mathbf{y} \mid \mathbf{w}) \label{3} \tag{3} \end{align}

How does equation \ref{3} become equation \ref{2}? What's the proof? I am looking for a proof. If it's an equation, there must be a proof. In general, $p(x, y) = p(x \mid y) p(y) = p(y \mid x) p(x)$, so the only way equation \ref{3} can become equation \ref{2} is if $p(\mathbf{y}) = p(\mathbf{x}) = 1$. If that's the case, why would this assumption be true?

I've seen equation \ref{1} in many other articles and papers, so either they are all wrong or I am not seeing the proof here.

score 0 · Answer 1 · answered Dec 23 '19 at 09:10

Your question deals with the problem of training a generative model Eq.(2) versus training a discriminative model Eq.(3). As highlighted in this article, you cannot go from Eq. (3) to Eq. (2), there are two definitions of a quantity to optimize; both being often misleadingly called likelihood. From the article:

While this is a valid way of obtaining a classifier, the description is misleading. To start with, the term “dis-criminative training” is a misnomer, because given a probabilistic model, there is only one correct likelihood andtherefore only one correct way to train it. What is really going on in (3) is that the model has changed, not thetraining principle.

You can also have a look at this answer.

score 0 · Accepted Answer · 2020-04-15T21:10:38.380

If we assume that the random variables $Z_i=(Y_i, X_i)$ are i.i.d., $i = 1, \dots, |\mathcal{D}|$, and that $p(x_i \mid \mathbf{w}) = p(x_i), \forall i$ (that is, $x_i$ does not depend on the parameters) and considering that $P(X, Y \mid Z) = P(Y \mid X, Z) P(X \mid Z)$, then we have

\begin{align} \operatorname{Pr}(\mathcal{D} \mid \mathbf{w}) &= \operatorname{Pr}(x_1, y_1, \dots, x_{|\mathcal{D}|}, y_{|\mathcal{D}|}\mid \mathbf{w}) \\ &=\prod_{i=1}^{|\mathcal{D}|}\operatorname{Pr}(x_i, y_i \mid \mathbf{w})\\ &=\prod_{i=1}^{|\mathcal{D}|}\operatorname{Pr}(y_i \mid x_i, \mathbf{w}) \operatorname{Pr}(x_i \mid \mathbf{w}) \\ &=\prod_{i=1}^{|\mathcal{D}|}\operatorname{Pr}(y_i \mid x_i, \mathbf{w}) \operatorname{Pr}(x_i) \end{align}

We now drop $\operatorname{Pr}(x_i)$ because we will be optimizing with respect to $\mathbf{w}$, and, in that case, the term $\operatorname{Pr}(x_i)$ is considered a constant (and the derivative of a constant is zero). In other words, our objective is

\begin{align} \mathbf{w}^* &= \operatorname{argmax}_{\mathbf{w}} \log \operatorname{Pr}(\mathcal{D} \mid \mathbf{w}) \\ &=\operatorname{argmax}_{\mathbf{w}} \log \left( \prod_{i=1}^{|\mathcal{D}|}\operatorname{Pr}(y_i \mid x_i, \mathbf{w}) \operatorname{Pr}(x_i) \right) \\ &=\operatorname{argmax}_{\mathbf{w}} \sum_{i=1}^{|\mathcal{D}|} \log \operatorname{Pr}(y_i \mid x_i, \mathbf{w}) \operatorname{Pr}(x_i) \\ &=\operatorname{argmax}_{\mathbf{w}} \sum_{i=1}^{|\mathcal{D}|} \log \operatorname{Pr}(y_i \mid x_i, \mathbf{w}) + \log \operatorname{Pr}(x_i) \\ &=\operatorname{argmax}_{\mathbf{w}} \sum_{i=1}^{|\mathcal{D}|} \log \operatorname{Pr}(y_i \mid x_i, \mathbf{w}) + \sum_{i=1}^{|\mathcal{D}|} \log \operatorname{Pr}(x_i) \\ &=\operatorname{argmax}_{\mathbf{w}} \sum_{i=1}^{|\mathcal{D}|} \log \operatorname{Pr}(y_i \mid x_i, \mathbf{w}) \end{align}

Obviously, $\sum_{i=1}^{|\mathcal{D}|} \log \operatorname{Pr}(x_i)$ is a constant with respect to ${\mathbf{w}}$, so it can be dropped.

The equation in Alex Graves' paper is thus not generally true, but, if that is our objective (which is the case in Graves' paper), for the purpose of optimization, we can drop $\operatorname{Pr}(x_i)$.

Neal's PhD thesis also mentions this likelihood in section 1.3.1. (p. 15).

Why is the joint distribution of the data set equal to the product of conditional distributions?

2 Answers2