18

Consider the log likelihood of a mixture of Gaussians:

$$l(S_n; \theta) = \sum^n_{t=1}\log f(x^{(t)}|\theta) = \sum^n_{t=1}\log\left\{\sum^k_{i=1}p_i f(x^{(t)}|\mu^{(i)}, \sigma^2_i)\right\}$$

I was wondering why it was computationally hard to maximize that equation directly? I was looking for either a clear solid intuition on why it should be obvious that its hard or maybe a more rigorous explanation of why its hard. Is this problem NP-complete or do we just not know how to solve it yet? Is this the reason we resort to use the EM (expectation-maximization) algorithm?


Notation:

$S_n$ = training data.

$x^{(t)}$ = data point.

$\theta$ = the set of parameters specifying the Gaussian, theirs means, standard deviations and the probability of generating a point from each cluster/class/Gaussian.

$p_i$ = the probability of generating a point from cluster/class/Gaussian i.

amoeba
  • 93,463
  • 28
  • 275
  • 317
Charlie Parker
  • 5,836
  • 11
  • 57
  • 113

2 Answers2

14

First, GMM is a particular algorithm for clustering, where you try to find the optimal labelling of your $n$ observations. Having $k$ possible classes, it means that there are $k^n$ possible labellings of your training data. This becomes already huge for moderate values of $k$ and $n$.

Second, the functional you are trying to minimize is not convex, and together with the size of your problem, makes it very hard. I only know that k-means (GMM can be seen as a soft version of kmeans) is NP-hard. But I am not aware of whether it has been proved for GMM as well.

To see that the problem is not convex, consider the one dimensional case: $$ L = \log \left(e^{-({x}/{\sigma_{1}})^2} + e^{-({x}/{\sigma_{2}})^2}\right) $$ and check that you cannot guarantee that $\frac{d^2L}{dx^2} > 0$ for all x.

Having a non-convex problem means that you can get stuck in local minima. In general, you do not have the strong warranties you have in convex optimization, and searching for a solution is also much harder.

Xi'an
  • 90,397
  • 9
  • 157
  • 575
jpmuc
  • 12,986
  • 1
  • 34
  • 64
  • 4
    Regarding the second point: k-means can be viewed as a special case of GMMs (more precisely, a limit case where the variances are taken to zero). If we can reduce k-means to the fitting of a GMM, the latter must be an NP-hard problem as well. – Lucas Dec 20 '14 at 17:31
  • 1
    @Lucas: Here is a [Cross Validated link](http://stats.stackexchange.com/questions/117874/k-means-as-a-limit-case-of-em-algorithm-for-gaussian-mixtures-with-covariances?rq=1) to your remark. – Xi'an Dec 20 '14 at 18:14
8

In addition to juampa's points, let me signal those difficulties:

  • The function $l(\theta|S_n)$ is unbounded, so the true maximum is $+\infty$ and corresponds to $\hat\mu^{(i)}=x_1$ (for instance) and $\hat\sigma_i=0$. A true maximiser should therefore end up with this solution, which is not useful for estimation purposes.
  • Even without considering the $k^n$ terms in the decomposition of the product of sums as a sum of products in $l(\theta|S_n)$, the function to be maximised in $\theta$ is highly multi-modal (in addition to being non-convex) hence a challenge for numerical methods. EM acknowledges the difficulty by converging to a local mode or saddle point and requiring multiple runs. As shown on the image below

taken from my book.

An additional remark: without calling the EM algorithm, one may use a standard optimisation algorithm (like Newton-Raphson) one parameter at a time, that is, iterate

  • find $\theta_1^\prime=\arg\max_{\theta_1} l(\theta|S_n)$
  • find $\theta_2^\prime=\arg\max_{\theta_2} l(\theta_1^\prime,\theta_{-1}|S_n)$
  • ...
  • find $\theta_v^\prime=\arg\max_{\theta_v} l(\theta_{-v}^\prime,\theta_v|S_n)$

if there are $v$ parameters and each step should increase the value of the target function $l(\theta|S_n)$, but this scheme will at best end up in the same mode as the EM algorithm.

Xi'an
  • 90,397
  • 9
  • 157
  • 575
  • OK, L is unbounded if variance is 0. But if we exclude them from possible parameters (so we assume all variance > 0), then L shoudn't be so high whenever infinitesimal chosen variance (because of other points). Am I right? Then, for this possible set of parameters, L would be bounded, and this will imply that EM algorithm converges (increasing bounded sequence). – ahstat May 06 '18 at 03:52
  • 1
    @ahstat: assuming the variances are strictly positive does not prevent EM to converge to a degenerate solution if started close enough. – Xi'an May 06 '18 at 07:46