EM algorithm guarantees finding a local rather then global minimum of the likelihood. As a consequence, the results are dependent on the initial conditions (e.g., if randomly choosing the initial cluster centers.) Some tutorials that I have seen mention running the EM algorithms several times in such cases. This however entails several practical considerations:
- How many times to run the algorithm?
- How to choose more fiable solutions?
- How to represent the results, if multiple solutions could be reasonable?
- How to estimate the optimal number of clusters, privided multiple solutions?
- When to give up altogether on clustering / mixture-modeling for the given dataset?
Background: I am clustering data in the context of enterotyping, i.e., assigning microbial compositions to several typical compositional types.
Update: Gelman-Rubin criterion
I wonder whether there exist anything in the spirit of Gelman-Rubin criterion used in a similar setting: when a mixture model is analyzed with Markov chain Monte Carlo technique. The criterion compares the variance among the MC threads launched with different initial conditions, and the variance within thread - the ratio that should converge to 1, if the number of the MC steps is sufficient.
Update 2: MC-EM and SAEM
Another approach, that partially an alternative to multiple re-runs of the EM, is Monte Carlo Expectation Maximization. See Rationale for MCEM and Avoiding burn-ins in MC-EM for more discussion, as well as A New Class of Stochastic EM Algorithms. Escaping Local Maxima and Handling Intractable Sampling