Why use EM algorithm instead of just plain old ML for mixture model?

Question

Let's say I have some [multivariate] data and want to fit a GMM to it. So I have $P_x=\sum_{i=1}^{n}\alpha_i{N(x;\theta_i)}$, where $x$ is an observation from the data, $\theta_i$ is the mean and covariance matrix parameters for the ith Gaussian, and $\sum_{i=1}^{n}\alpha_i=1$, i.e. it is a contrast to ensure a valid probability distribution of the mixed Gaussians.

As we know this would be easy to setup a liklihood for and optimize via maximum liklihood. The optimized result would be a local optimum and a valid probability distribution (I could ensure a global optimum by doing something like self-contrastive estimation as described by Ian Goodfellow), but admittedly now I'm a bit stuck on interpretation.

The mixing weights $\alpha_i$ seem like they would represent the marginal probability of each group (i.e., $\alpha_i$ would be the marginal probability of group "i"...$P(group_i)$), but then the ith Gaussian would be like the likelihood of group "i" given the data. I.e., $P(group_1|x), P(group_2|x)$, etc. which when summed would be like the normalizing constant for $P(x|group_1), P(x|group_2)$, etc... Or is the output of the ith Gaussian $P(x|group_i)$ (which would make more sense...)? If the latter, then since $P(group_i|x)=\frac{P(x|group_i)P(group_i)}{P(x)}$ is the output for EM algorithm model I could seemingly back calculate very easily what EM algorithm would provide.

Any ideas if I'm viewing this correctly. It seems if I just want to fit a flexible PDF to my data my method would work though, correct? Still trying to reconcile the difference when applying EM algorithm though to the same type of problem, even if the method I describe is legit...

The EM algorithm *is* maximum likelihood. The advantage is that one can use the method-of-moments estimators to maximize the likelihood conditional on the group assignment. I suppose one could write out the very irregular joint likelihood of the group assignments and means and do some kind of grid search, but this would be unpractical due to computational complexity. — AdamO, Feb 01 '18 at 21:20
Yes but it seems like EM has some kind of extra step of finding which group each data point belongs to on each iteration. What I'm describing doesn't entail that...which is one reason I'm getting a bit lost. There's no explicit need to assign a group in what I've detailed. — JPJ, Feb 01 '18 at 21:22
Frankly I don't understand what you're proposing, nor can I tell if it's actually different from EM. Why don't you set up an implementation? — AdamO, Feb 01 '18 at 21:29
LOL....noted. I'm essentially proposing a neural network with Gaussian activation functions but stipulating that the activation weights will be structured/designed to be the covariances and means, and the output weights would form a contrast to ensure a valid probability distribution. I WILL setup an implementation because my main goal is density estimation, but asking this in parallel since interpret ability and connecting to existing methods is always very nice :-) — JPJ, Feb 01 '18 at 21:36
I'm also not sure what you're up to, but the key intuition of EM over direct likelihood maximization for this model is that maximizing $\theta$ would be *much easier* if you knew the group indicators (because each group could be updated separately), but that expectations of those indicators given the data are a function of $\theta$. So alternating between computing those expectations (E step) and using them as if they were the indicators (M step) will maximize the joint likelihood. It's possible that explicitly putting old and new $\theta$ in your notation would make this clearer. — conjugateprior, Feb 02 '18 at 02:58
Also, the output of an EM algorithm is a set of good $\theta$, not the posterior membership probabilities, although the latter can obviously be generated from the former. — conjugateprior, Feb 02 '18 at 03:00
I went to the Wikipedia page on mixture models and it seems to cover some of my questions comparing straight ML to EM algorithm. I'll lurk there as it may answer my q's. — JPJ, Feb 02 '18 at 03:38
Possible duplicate of [Why is optimizing a mixture of Gaussian directly computationally hard?](https://stats.stackexchange.com/questions/94559/why-is-optimizing-a-mixture-of-gaussian-directly-computationally-hard) — Xi'an, Feb 02 '18 at 07:25

score 4 · Answer 1 · answered Feb 02 '18 at 05:28

After some searching around pretty hard I found a solid answer to my question of why not just use plain old ML instead of EM algorithm in a book called "Pattern Recognition and ML by Christopher Bishop" here.

Excerpt copied exactly below from pgs 433,434 (hopefully this is OK to do since I referenced/gave credit to the source). Note he has a figure reference that helps.

"Before discussing how to maximize [a GMM likelihood] function, it is worth emphasizing that there is a significant problem associated with the maximum likelihood framework applied to Gaussian mixture models, due to the presence of singularities. For simplicity, consider a Gaussian mixture whose components have covariance matrices given by $\sum{k}=\sigma^2I$, where $I$ is the unit matrix, although the conclusions will hold for general covariance matrices. Suppose that one of the components of the mixture model, let us say the jth component, has its mean $u_j$ exactly equal to one of the data points so that $µ_j = x_n$ for some value of n. This data point will then contribute a term in the likelihood function of the form $N(x_n|x_n, σ^2_j I) = \frac{1}{(2π)^{1/2}}\frac{1}{σ_j}$. If we consider the limit $σ_j → 0$, then we see that this term goes to infinity and so the log likelihood function will also go to infinity. Thus the maximization of the log likelihood function is not a well posed problem because such singularities will always be present and will occur whenever one of the Gaussian components ‘collapses’ onto a specific data point. Recall that this problem did not arise in the case of a single Gaussian distribution. To understand the difference, note that if a single Gaussian collapses onto a data point it will contribute multiplicative factors to the likelihood function arising from the other data points and these factors will go to zero exponentially fast, giving an overall likelihood that goes to zero rather than infinity. However, once we have (at least) two components in the mixture, one of the components can have a finite variance and therefore assign finite probability to all of the data points while the other component can shrink onto one specific data point and thereby contribute an ever increasing additive value to the log likelihood. This is illustrated in Figure 9.7. These singularities provide another example of the severe over-fitting that can occur in a maximum likelihood approach. We shall see Section 10.1 that this difficulty does not occur if we adopt a Bayesian approach. For the moment, however, we simply note that in applying maximum likelihood to Gaussian mixture models we must take steps to avoid finding such pathological solutions and instead seek local maxima of the likelihood function that are well behaved."

Why use EM algorithm instead of just plain old ML for mixture model?

1 Answers1