4

Can Someone explain me how to interpret the log likelihood measure when evaluating clustering techniques?

Let's say I am using Gaussian Mixture with Expectation Maximization, and I want to choose the best number of clusters. Each clustering model outputs a log likelihood, but which is the best? A smaller one, a bigger one? Weka for example, even outputs negative values.

Can someone explain this to me? I've been searching for this topic for about two weeks, and didn't find an answer. Even though I have knowledge in statistics, statistical inference is not my cup of tea.

Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96
Carlos Costa
  • 51
  • 1
  • 2
  • 1
    Have you considered searching our site for [threads about "likelihood"](http://stats.stackexchange.com/search?tab=votes&q=likelihood)? – whuber Feb 23 '16 at 23:06
  • 1
    Yes before this post i searched for this in cross validated. But I didn't find an answer in the clustering context. – Carlos Costa Feb 24 '16 at 22:16
  • 1
    Could you tell us how the likelihood in a clustering context might have different properties from any other likelihood? That might help indicate what aspects of this question people should focus on. – whuber Feb 24 '16 at 22:19
  • 1
    I am trying to better understand the log likelihood measure. For example: Is a likelihood of -22 for EM clustering with 4 clusters better than a log likelihood of -24 for 6 clusters? At the moment, I don't know how to interpret this value. Even its range. Sometimes WEKA gives me a positive log likelihood which is very strange given every log in [0,1] should be positive. – Carlos Costa Feb 25 '16 at 23:17
  • 1
    Likelihoods for families of *continuous* distributions are actually products of probability *densities* rather than probabilities themselves. [Densities can be arbitrarily large, far exceeding $1$.](http://stats.stackexchange.com/questions/4220) That can make log likelihoods positive. For interpretation, you might find the best information by searching our site for [tag:AIC]. – whuber Feb 26 '16 at 00:22

2 Answers2

2

The likelihood is very similar to a probability. Here, it is the probability of each observation given the cluster label assigned.

If you take the log of this, negative values naturally arise, because likelihoods are supposed to be in $[0;1]$.

Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96
  • Thank you for the answer. Now it becomes more clear to me. I am still with doubts regarding the evaluation of the metric in clustering. But I will continue to search. Thank you – Carlos Costa Feb 24 '16 at 22:20
1

The log-likelihood is dependent on the probability model(s) you consider for the observed data as well as the data themselves.

If the likelihood of the sample is greater under one model than another, we tend to infer that the former model is more likely than the later. Whilst not a probability per se (in fact, it is a probability density) the likelihood can rank two probability models in such a fashion, even for a single observation. The log-likelihood is simply the log of the likelihood. If a likelihood is less than 1, the log-likelihood is negative, but this can arise from noisy data, sparse data, small sample sizes, among a host of other causes. We cannot objectively say anything based on a single likelihood or log-likelihood, it is strictly relative. It only compares models.

One frequently used model for clustering is a Gaussian density, which you describe. It gives probability laws relating how far an observation will fall from its "centroid" or mean. The optimal log-likelihood model is a saturated model where out of $n$ observations, there are $n$ clusters having the observed value as its centroid, and the standard deviation(s) is/are irrelevant.

Log-likelihoods are used frequently in statistical inference; but only to infer whether one probability model is better than another for observed data. This is a confirmatory, and not exploratory comparison. They do not determine the total number of clusters because you are not comparing models, which is an exploratory question. The tendency of likelihood in that case is to overfit because maximum likelihood has some high dimensional problems.

If you have the log-likelihood, however, you can convert that value to a Bayesian Information Criterion. This enforces sparse models by penalizing the total number of parameters in the model.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
AdamO
  • 52,330
  • 5
  • 104
  • 209