6

I'm having trouble understanding some of the formulas in this paper related to BIC calculation (Dan Pelleg and Andrew Moore, X-means: Extending K-means with Efficient Estimation of the Number of Clusters).

First the variance equation:

  • R - number of points
  • K - number of clusters
  • $\mu_i$ - centroid associated with ith point.
  • $\sigma^2 = \frac{1}{R-K}\sum_{i}(x_i - \mu_{(i)})^2 $

The log likelyhood then uses this sigma. Am I reading this right, they're using 1 covariance matrix for all clusters (see quote below, they are)? This makes no sense. If you have 5 clusters, each one is a Gaussian according to k-means algorithm. So wouldn't it make sense to compute covariance $\sigma^2_i$ for each cluster and use that?

My second question is regarding number of parameters to use in the BIC score. The paper mentions

Number of free parameters $p_j$ is simply the sum of K-1 class probabilities, M*K centroid coordinates, and one variance estimate.

How do you get the K-1 class probabilities? I could do # of points in class i / total number of points. But then it's K-1, which probability is left out of the sum?

P.S. If anyone has a nicer paper on estimating k using similar methods I'd like to read that as well. At this point I'm not too concerned with speed.

Thanks for your help.

whuber
  • 281,159
  • 54
  • 637
  • 1,101
Budric
  • 165
  • 1
  • 5
  • Your first question will be answered if you go back and transcribe the variance formula correctly and then refer to the definition of $\mu_{(i)}$ at the end of section 2. The second question is hard to understand: $p_j$ is a *count* equal to $K-1 + M*K + 1$. There's nothing left to estimate! – whuber Jul 15 '11 at 15:43
  • First question, still don't get it. My mistake was subscript $x_i$? Second question. Thank you, that was my reading comprehension. I thought it was sum of probabilities .... Instead as you point out it's sum of K-1 and other terms. – Budric Jul 15 '11 at 15:51

1 Answers1

4

Let the clusters be indexed by $j = 1, \ldots, K$ with $K_j \gt 0$ points in cluster $j$. Let $\mu_j$ (no parentheses around the subscript) designate the mean of cluster $j$. Then, because by definition $\mu_{(i)}$ is the mean of whichever cluster $x_i$ belongs to, we can group the terms in the summation by cluster:

$$\eqalign{ \sigma^2 &= \frac{1}{R-K}\sum_{i}(x_i - \mu_{(i)})^2 \\ &= \frac{1}{R-K}\sum_{j=1}^K\sum_{k=1}^{K_j}(x_k - \mu_j)^2 \\ &= \frac{1}{R-K}\sum_{j=1}^K K_j \frac{1}{K_j}\sum_{k=1}^{K_j}(x_k - \mu_j)^2 \\ &= \frac{1}{R-K}\sum_{j=1}^K K_j \sigma_j^2 },$$

with $\sigma_j^2$ being the variance within cluster $j$ (where we must use $K_j$ instead of $K_j-1$ in the denominators to handle singleton clusters). I believe this is what you were expecting.

whuber
  • 281,159
  • 54
  • 637
  • 1,101
  • Thanks, no this is not what I was expecting at all. I was expecting to compute covariance matrix for each j - dimension M > 1. Use $\mu_j$ along with covariance j, in its own multi variate Gaussian distribution to compute point probability. And since the means add M*K free parameters. All the covariance matrices would add M*M*K more parameters. – Budric Jul 15 '11 at 16:10
  • @Budric Well, that's not what they are doing! Obviously they are assuming a common covariance structure around each center. This proves your original interpretation is correct. But why does a common covariance "make no sense"? It's a good place to start. You can do *post hoc* tests to see whether it's a reasonable assumption. If it's not, then it sounds like you might need to generalize the X-means approach yourself :-). – whuber Jul 15 '11 at 16:23