About the derivation of EM for mixture of Gaussians

Question

I'm reading Andrew Ng's note about Mixtures of Gaussians and the EM algorithm

He writes the likelihood of data as

where random variables $z^{(i)}$'s indicate which of the $k$ Gaussians each $x^{(i)}$ had come from. Then he says that if we knew what the $z^{(i)}$'s were, the maximum likelihood problem would have been easy. Specifically, we could then write down the likelihood as

I could not get this last step. If we knew what the $z^{(i)}$'s were, then shouldn't we just need to calculate the likelihood from the Gaussian distribution (I understand that if we knew $z^{(i)}$'s we do not need to take a weighted sum over $k$ Gaussians but why do we still need to multiply with the probability of the known $z^{(i)}$, i.e., $p(z^{(i)};\phi)$)? In other words shouldn't the likelihood be as below:

There is a similar question here but it does not answer my question.

@Xi'an That question is about how to take the derivative (if I'm not wrong). My question is about how to get the likelihood formula given that we know which Gaussian the data come from. — Sanyo Mn, Mar 14 '21 at 16:08

microhaus · Accepted Answer · 2021-03-14T19:24:25.180

What I understand Andrew Ng is saying is the following:

We want to derive estimates of the parameters by solving for $\phi, \mu, \Sigma$. We do this by maximising $l(\phi, \mu, \Sigma)$ i.e. the log likelihood of the observed data $x^{(i)}$ w.r.t to the parameters. However, there are latent variables $z^{(i)}$, which complicate matters.

Ideally, we would like to do the following:

$$\begin{align}\phi^*, \mu^*, \Sigma^* &= \text{argmax}_{\phi, \mu, \Sigma} \sum^m_{i=1} \log p(x^{(i)}; \phi, \mu, \Sigma) \\ &= \text{argmax}_{\phi, \mu, \Sigma} \sum^m_{i=1} \log \sum^k_{z^{(i)} = 1} p(x^{(i)}, z^{(i)}; \phi, \mu, \Sigma) \\ \end{align}$$

I.e. Maximising the log-likelihood of the observed data, after having marginalised out the latent variables $z^{(i)}$ over all $k$ states in the joint distribution.

What he means by:

However, if we set to zero the derivatives of this formula with respect to the parameters and try to solve, we’ll find that it is not possible to find the maximum likelihood estimates of the parameters in closed form. (Try this yourself at home.)

Is that it is difficult to maximise the expression due to maximising a $\log(\sum)$ term, that is, a summation contained within a log. Frequently in ML you will find that this a difficulty encountered with latent variable models, which is what necessitates EM.

Now imagine if $z^{(i)}$ were also observed data. The log-likelihood of the observed data, that is both $x^{(i)}$ and $z^{(i)}$, is the log of the joint distribution evaluated at $x^{(i)}$ and $z^{(i)}$, where the latter is now known.

So in this case, as the log-likelihood is just a function of the parameters, our maximisation problem would be:

$$\begin{align}\phi^*, \mu^*, \Sigma^* &= \text{argmax}_{\phi, \mu, \Sigma} \sum^m_{i=1} \log p(x^{(i)}, z^{(i)}; \phi, \mu, \Sigma) \\ &= \text{argmax}_{\phi, \mu, \Sigma} \sum^m_{i=1} \log \left( p(x^{(i)} | z^{(i)}; \mu, \Sigma) p (z^{(i)}; \phi)\right) \\ &= \text{argmax}_{\phi, \mu, \Sigma} \sum^m_{i=1} \log p(x^{(i)} | z^{(i)}; \mu, \Sigma) + \log p (z^{(i)}; \phi)\\ \end{align}$$

However, we cannot do this because the $z^{(i)}$ are not actually known.

To close, there are two takeaways in my opinion:

That $\log(\sum)$ is difficult to maximise when the arguments which you are maximising with respect to are inside the summation.
That the presence of latent variables complicates maximum likelihood estimation, thereby necessitating techniques such as EM.

Just read this back to myself and realised I may have missed the point of your question.

The reason why you care about $p(z^{(i)}; \phi)$ is in Ng's language, you are have specified a generative model $p(x, z)$ rather than a discriminative model, although applied to unsupervised density estimation rather than supervised learning. If that is somewhat unclear just prompt below and I will edit.

Yes, still unclear. To reiterate my problem, if we knew $z^{(i)}$'s then the likelihood should be calculated as the probability of $x^{(i)}$ coming from that specific known Gaussian, we do not need to multiply that with the probability of choosing that specific Gaussian since we knew it. — Sanyo Mn, Mar 14 '21 at 17:04
Understood. But then you haven't accounted for the probability of how the $z^{(i)}$s were *generated* in the first place in your model of $p(x, z)$. Do continue to say if that is still unclear. — microhaus, Mar 14 '21 at 17:08
Yes, that is true, but still Ng's derivation is not clear to me. Maybe he is skipping some steps or assumptions. If you can elaborate more I would be glad. — Sanyo Mn, Mar 14 '21 at 17:36
To be clear, would you like further clarification on the point concerning ignoring $p(x^{(i)}; \phi)$ in your 1st comment, or the derivation? Because if it's the derivations you need to indicate exactly *which* derivation, and what is unclear, and I will be in a better position to try and assist. — microhaus, Mar 14 '21 at 17:43
If I number the likelihood formulas in my original post as 1, 2 and 3. I do not understand how we go from 1 to 2. According to my reasoning we can go from 1 to 3 (formula 3 is my invention, it is not part of Ng's notes). What I need is a detailed explanation of how we go from 1 to 2. — Sanyo Mn, Mar 14 '21 at 18:47
Btw, I was going to edit, but now that you've accepted I won't amend. Your "formula 3" would not constitute a correct specification of the log-likelihood for a Gaussian mixture model under any relevant circumstances. — microhaus, Mar 15 '21 at 13:22

About the derivation of EM for mixture of Gaussians

1 Answers1