EM algorithm: Why do we compute the expectation with respect to the conditional distribution?

Question

I am trying to understand the EM algorithm, and I see that in the expectation step,we need to compute $E_{Z|X} {\log p(X,Z)} = \sum_{Z} {p(Z|X) \log p(X,Z)}$ where $Z$ is the hidden variable and $X$ is the observed variable. I am understanding that we need to compute the expected value of $\log p(X,Z)$ since we cannot compute the exact value of $\log p(X,Z)$ due to the variable $Z$ whose value we are not aware of. But I have 2 separate confusions here:

Why don't we do, say, a block coordinate descent instead? That is, why don't we compute $Z$ and the parameters of the model (say $\theta$) iteratively such that we compute $Z$ in one step using the current value of $\theta$ and $\theta$ in the other step using the current value of $Z$?
Let's say I understood that we cannot do block coordinate descent but we need to compute that expectation of $\log p(X,Z)$. Then why don't we compute $E_Z {\log p(X,Z)} = \sum_{Z} {p(Z) \log p(X,Z)}$ but we compute $E_{Z|X} {\log p(X,Z)}$? What does it mean (intuitively, or in words) to compute an expected value with respect to a conditional probability?

I think in one of my unanswered question you may find your answer. — Alice, May 03 '17 at 03:43
Here is my question https://stats.stackexchange.com/questions/277073/question-about-e-step-in-em-algorithm-a-challenge-part-any-help-please please read through the posted photo. — Alice, May 03 '17 at 03:45
I think the first two paragraphs are belong to your second question. I really confused about it, but may it will be clear for you. — Alice, May 03 '17 at 03:46

GeoMatt22 · Accepted Answer · 2017-05-03T03:43:59.250

For your first question: in EM you are alternating between updating $Z$ and updating $\theta$. (Note however that $Z$ is commonly a set of categorical variables, i.e. labels, in EM applications, so it is not exactly a "gradient descent" per se.)

For the second question, the most common example of a conditional expectation is probably the standard regression problem, i.e. $$y=f(x)+\epsilon\,,\,\mathbb{E}\big[\epsilon\big]=0\implies\mathbb{E}\big[y\mid{x}\big]=f(x)$$

With regards to the first question, I would recommend checking out the Wikipedia description. You could "update $Z$" based on the current $\theta$ (called hard EM), but in general you can update the entire posterior $p\big[Z\mid{X,\theta}\big]$ (note that the observed data $X$ is fixed throughout, see also here). For a concrete example of the difference, see here.

With regards to the requested clarification on the second question: Imagine you want to do standard MLE for $\theta$ based on the data $\hat{X}$. But the data $\hat{X}$ is not fully observed, that is, $\hat{X}=[X,Z]$ but $Z$ is unknown. As noted in Wikipedia, a natural approach would be to maximize the marginal likelihood $$p\big[X\mid{\theta}\big]=\sum_Zp\big[\hat{X}\mid{\theta}\big]$$ over possible $Z$ values. Since $X$ is known and fixed, EM proceeds by taking the expected value of the likelihood over possible $Z$ values, conditioned on $X$ and $\theta$.

Note that $p\big[Z\big]=\int{p}\big[Z\mid{X}\big]p\big[X\big]dX$. But since $X$ is known and fixed, we don't care what the likelihood of $Z$ is given some other hypothetical value of $X$. (For simplicity I omit the $p\big[\ldots\mid{\theta}\big]$ in the above.)

Thanks, but I am confused. Followup questions: 1) We are not updating the value of $Z$ at all, right? We are just updating the expected value of log likelihood $E(\log p(Z,X))$. Do we need to recompute $P(Z|X)$ in the next expectation step to recompute $E(\log P(Z,X))$? Even if we do that, we are not updating the actual value of $Z$ in the expectation step. 2) I can't see the relation of this example to $E_{Z|X}{\log p(Z,X)}$. The condition is not a subscript in your example, right? And the main point of my question is why don't we compute $E_Z{\log p(Z,X)}$ instead of $E_{Z|X}{\log p(Z,X)}$. — user5054, May 03 '17 at 02:14

EM algorithm: Why do we compute the expectation with respect to the conditional distribution?

1 Answers1