Why the EM algorithm instead of a more direct computation?

Question

Assume we have a probability distribution $P(y,z|\theta)$, where $x$ is the total set of variables divided into observable variables $y$ and hidden variables $z$, and data on observable variables $y$. We wish to determine $\theta$ that maximizes the likelihood $P(y|\theta)$.

This tutorial (after my breaking their total set of variables $x$ into $y,z$ for clarity) presents the EM algorithm as choosing an arbitrary $\theta^0$ for a probabilistic model's parameters and then repeatedly solving for $m=1,2,\dots$ until convergence:

\begin{eqnarray*} \theta^m &=& \text{argmax}_\theta \sum_z \text{ log }p(y,z|\theta)p(z|y,\theta^{m-1}) \end{eqnarray*}

That requires maximizing $\theta$ over a summation over $z$.

My question is: why not simply more directly solve the problem "what is the most likely value $\theta^*$ for $\theta$ given $y$," that is:

\begin{eqnarray*} \theta^* &=& \text{argmax}_\theta P(y|\theta) & \\ &=& \text{argmax}_\theta \sum_z P(y,z|\theta) \end{eqnarray*}

which also needs to solve a maximization of $\theta$ over a summation over $z$, but does it in a single shot and is conceptually simpler?

In other words, why do we need the iterative algorithm, and why do we need the log?

Update: it's been suggested that this is duplicate of Why should one use EM vs. say, Gradient Descent with MLE?. It is indeed highly related, but the answer there says that EM is better but does not provide an explanation as for why. Also, it does not address the need to use the logarithm.

Writing$$\theta^= \arg\max_\theta P(y|\theta)$$does not mean this value can be easily found for functions $P(y|\theta)$ that are not regular enough (e.g., non-concave or multimodal). — Xi'an, Apr 26 '18 at 07:57
Better than the alternative we are discussing and that you said does worse, namely optimizing $P(y|\theta)$. — user118967, Apr 26 '18 at 17:29
In general, you **cannot** generically find $\hat \theta = \arg \max_\theta P(y|\theta)$ directly. But the EM algorithm is one recipe (among many) for actually finding $\hat \theta$ via an iterative algorithm. — Cliff AB, Apr 26 '18 at 18:11
You may have seen a particular illustrative example in which you *could* find $\hat \theta$ directly, and then it was demonstrated that the EM algorithm arrives at the same solution. In that case, the use of the EM algorithm was *only* to help illustrate how the algorithm works; in practice, the closed form solution would be used. But if no closed form solution exists, then you may still be able to use the EM algorithm. — Cliff AB, Apr 26 '18 at 18:15

Xi'an · Answer 1 · 2018-04-26T20:06:27.273

3

Stating that the MLE solves the optimisation program $$\hat{θ}=\arg\max_θ f(y|θ)$$ does not explain how one proceeds in practice to achieve this derivation. It is simply stating an inference principle.

Above is the log-likelihood surface of the likelihood $$L(\mu_1,\mu_2)=\prod_{i=1}^n \left\{\frac{3}{10} \varphi(x_i;\mu_1,1) + \frac{7}{10} \varphi(x_i;\mu_2,1/2) \right\}$$ for a sample of size $n=92$. This is a mixture of two Gaussian distributions with unknown means. This is a smooth surface but from an optimisation perspective, it is not regular enough, offering saddle points, plateaus, multiple modes. This means that an off-the-shelf minimisation method like the gradient method or Newtwon-Raphson algorithm is unable to find the global maximum without a sufficiently fine partition of the parameter space. Optimising this non-convex function is a difficult problem, which is not easily tackled as a purely mathematical maximisation problem.

As for the question about the log, i.e., about maximising repeatedly $$\mathbb{E}_{\theta^m}[\log P(y,Z|\theta)|Y=y]$$ instead of once $$\int P(y,z|\theta)\text{d}z$$ using the logarithm implies that the target likelihood $P(y|\theta)$ increases at each iteration of EM as $$\mathbb{E}_{\theta^m}[\log P(y,Z|\theta^{m+1})|Y=y]\ge\mathbb{E}_{\theta^m}[\log P(y,Z|\theta^m)|Y=y]$$and $$\mathbb{E}_{\theta^m}[\log P(y,Z|\theta)|Y=y]=\log P(y|\theta)+\mathbb{E}_{\theta^m}[\log P(Z|y,\theta)|Y=y]$$and $$\mathbb{E}_{\theta^m}[\log P(Z|y,\theta^m)|Y=y]\ge \mathbb{E}_{\theta^m}[\log P(Z|y,\theta^{m+1})|Y=y]$$

edited Apr 26 '18 at 20:06

answered Apr 26 '18 at 17:52

Xi'an

90,397
9
157
575

1

Sorry for this late reply. I am not very knowledgeable about optimization but I know that people use gradient method to train Neural networks which is very non-convex. You said that gradient methods can't deal with non-convex optimization. Am I right? – floyd Sep 07 '19 at 10:04
I second the above comment by @floyd. Deep neural network is also non-convex and complex, we could use gradient based method to optimized dnn, then why would we still need EM nowadays? – avocado Mar 28 '21 at 11:13
OP here. @Xi'an , sorry for the extremely late reply. In your comment on the question, you say $P(y|\theta)$ may not be regular enough. However, we can still find a local maximum. One might argue finding a local maximum is not enough for a non-convex function. As floyd and avocado others mention above, this does not seem to be a problem for deep neural networks. – user118967 Jan 22 '22 at 15:55
Besides, EM is also finding a local maximum, only it is doing so for a different function (referred to as $Q$ in EM texts). So the question remains as to why it is better to find a $Q$ local maximum than finding a $P(y|\theta)$ local maximum. Is $Q$ more regular than $P(y|\theta)$? If so, I assume that is a non-trivial fact to demonstrate. – user118967 Jan 22 '22 at 15:56
I have neither experience nor knowledge as to whether and why deep neural networks attain a global maximum. To quote [Bengio et al.](https://shorturl.at/bejFX) (2016, Chap. 8, p.269): "in the context of deep learning, we rarely use empirical risk minimization (...) training algorithms do no usually halt at a local minimum (...) a machine learning algorithm halts when a convergence criterion based on early stopping is satisfied" – Xi'an Jan 23 '22 at 10:43

grand_chat · Answer 2 · 2018-04-26T19:54:50.213

The log is present because of the notion that it's often easier to maximize the log-likelihood $\log p(y\mid\theta)$ than the raw likelihood $p(y\mid\theta)$. In the case where we have a hidden variable $z$, the summation over $z$ prevents the log from acting on $p(y,z\mid\theta)$: $$\log p(y\mid\theta)=\log\sum_zp(y,z\mid\theta),\tag1$$ so applying the log is no help in maximizing the expression (1). The insight in the EM algorithm is to get past this roadblock by approximating $\log p(y\mid\theta)$ by its expectation over the hidden variable: $$ \log p(y\mid\theta)\approx E_z \log p(y,z\mid\theta)\tag2 $$ where $E_z$ denotes expectation over $p(z\mid y,\theta)$, the posterior distribution of the hidden variable $z$ given $y$ and $\theta$. This leads us to consider $$ E_z \log p(y,z\mid\theta) = \sum_z \log p(y,z\mid\theta)\, p(z\mid y,\theta)\tag3 $$ where now the log acts directly on the likelihood of the total data. The EM algorithm then proceeds iteratively, by alternating between: [E step] evaluating $p(z\mid y,\theta)$ using an initial guess for $\theta$ and plugging into (3); and: [M step] maximizing (3) over $\theta$ to obtain a revised guess for $\theta$. The maximization that takes place in the M step is made more tractable by the fact that the log now acts directly on $p(y,z\mid\theta)$.

The theory behind EM assures us that this algorithm generates a sequence $\{\theta^m\}$ for which the log likelihood (1) increases with every iteration.

That's a very informative answer, thank you. You address the log question very well (although I would have liked to see the derivation of $(2)$), but what about the need for iterations? Why can't we maximize $\theta$ in one shot? — user118967, Jan 22 '22 at 18:21

Why the EM algorithm instead of a more direct computation?

2 Answers2