Fitting Gaussian mixture models with dirac delta functions

Question

I was told that using gradient methods for Gaussian mixture models may end up with Dirac delta function(s). I hadn't thought of this problem before, but when I verify this, it does seem to be a problem.

For example, let us consider a mixture of 2 Gaussians, and data points $x_1, x_2, \cdots, x_m$ ($m\gg$ 2). The following models gives a likelihood of infinity:

One mixture $c_1$ fits any of the data point, say $x_1$, by a Dirac delta function.
The other mixture $c_2$ fits the remaining data points with a wide-spreading Gaussian.

The likelihood

$$\begin{align} p(\mathcal{D})&=\prod_{i=1}^mp(x_i)\\ &=\prod_{i=1}^m\bigg[p(c_i=1)p(x_i|c_i=1)+p(c_i=2)p(x_i|c_i=2)\bigg] \end{align} $$

Then for $x_1$, its probability density if infinity. For $x_2,\cdots,x_m$ the first term is zero, but the second term is non-zero. Then the overall likelihood is infinity.

I am wondering if my understanding is correct. If it is, I am confused why EM doesn't encounter this problem, as fitting GMM with Dirac delta functions is not typically discussed in textbooks.

I am further puzzled with the objective of fitting GMMs. It seems that we don't have to (and it's not right to) maximize the likelihood. The maximum likelihood is infinity as shown above. You don't need to maximize that, and it has already been there. But EM algorithms try to maximize the likelihood by alternately pushing the lower-bound of likelihood to be tight and optimizing within the lower bound. This raises a doubt if EM working at all is just because it cannot find global optima. Otherwise, EM will fit Dirac delta.

I am quite confused and not sure what's wrong.

Your intuition is correct. This has been discussed in [Pattern Recognition and Machine Learning](http://users.isr.ist.utl.pt/~wurmd/Livros/school/Bishop%20-%20Pattern%20Recognition%20And%20Machine%20Learning%20-%20Springer%20%202006.pdf) by Bishop. [This post](https://stats.stackexchange.com/questions/219302/singularity-issues-in-gaussian-mixture-model) also talks about it in detail. The jist of the story is that this is a major limitation of the maximum likelihood approach using the EM algorithm, hence a Bayesian approach is usually preferred. — kedarps, Jan 10 '18 at 05:02
Thank you very much for the pointers. Now I understand better that the singularity problem occurs for whatever algorithms we use to fit GMMs. But still, MLE is a practical criterion for learning GMMs. — Mou, Jan 11 '18 at 03:06

score 3 · Answer 1 · answered May 07 '18 at 16:31

I am confused why EM doesn't encouter this problem, as fitting GMM with Dirac delta functions is not typically discussed in textbooks.

The unboundedness of the likelihood of a Gaussian mixture model is discussed in many textbooks (incl. mine). It is rarely a problem for EM as the corresponding modes are very narrow and hence do not constitute domains of attraction for most starting values of EM (unless one starts with $\mu_1=x_1$, say).

the singularity problem occurs for whatever algorithms we use to fit GMMs. But still, MLE is a practical criterion for learning GMMs.

The problem only occurs when considering the likelihood function alone. Moment estimators are not facing this difficulty and neither do Bayesian methods, since the vicinity of $\sigma_1$ gets zero prior probability. Here is an illustration from Chopin and Robert (2010) of a posterior sample obtained by nested sampling for the Gaussian mixture $$0.3{\cal N}(0,1)+0.7{\cal N}(\mu,\sigma^2)$$

While some particles are located close to $\sigma=0$, they soon escape this vicinity and concentrate on another mode of the likelihood. Note also that a result by Redner and Walker (1984) demonstrate that there exist consistent EM solutions.

Fitting Gaussian mixture models with dirac delta functions

1 Answers1