I was told that using gradient methods for Gaussian mixture models may end up with Dirac delta function(s). I hadn't thought of this problem before, but when I verify this, it does seem to be a problem.
For example, let us consider a mixture of 2 Gaussians, and data points $x_1, x_2, \cdots, x_m$ ($m\gg$ 2). The following models gives a likelihood of infinity:
- One mixture $c_1$ fits any of the data point, say $x_1$, by a Dirac delta function.
- The other mixture $c_2$ fits the remaining data points with a wide-spreading Gaussian.
The likelihood
$$\begin{align} p(\mathcal{D})&=\prod_{i=1}^mp(x_i)\\ &=\prod_{i=1}^m\bigg[p(c_i=1)p(x_i|c_i=1)+p(c_i=2)p(x_i|c_i=2)\bigg] \end{align} $$
Then for $x_1$, its probability density if infinity. For $x_2,\cdots,x_m$ the first term is zero, but the second term is non-zero. Then the overall likelihood is infinity.
I am wondering if my understanding is correct. If it is, I am confused why EM doesn't encounter this problem, as fitting GMM with Dirac delta functions is not typically discussed in textbooks.
I am further puzzled with the objective of fitting GMMs. It seems that we don't have to (and it's not right to) maximize the likelihood. The maximum likelihood is infinity as shown above. You don't need to maximize that, and it has already been there. But EM algorithms try to maximize the likelihood by alternately pushing the lower-bound of likelihood to be tight and optimizing within the lower bound. This raises a doubt if EM working at all is just because it cannot find global optima. Otherwise, EM will fit Dirac delta.
I am quite confused and not sure what's wrong.