X-mean algorithm BIC calculation question

Question

I'm having trouble understanding some of the formulas in this paper related to BIC calculation (Dan Pelleg and Andrew Moore, X-means: Extending K-means with Efficient Estimation of the Number of Clusters).

First the variance equation:

R - number of points
K - number of clusters
$\mu_i$ - centroid associated with ith point.
$\sigma^2 = \frac{1}{R-K}\sum_{i}(x_i - \mu_{(i)})^2 $

The log likelyhood then uses this sigma. Am I reading this right, they're using 1 covariance matrix for all clusters (see quote below, they are)? This makes no sense. If you have 5 clusters, each one is a Gaussian according to k-means algorithm. So wouldn't it make sense to compute covariance $\sigma^2_i$ for each cluster and use that?

My second question is regarding number of parameters to use in the BIC score. The paper mentions

Number of free parameters $p_j$ is simply the sum of K-1 class probabilities, M*K centroid coordinates, and one variance estimate.

How do you get the K-1 class probabilities? I could do # of points in class i / total number of points. But then it's K-1, which probability is left out of the sum?

P.S. If anyone has a nicer paper on estimating k using similar methods I'd like to read that as well. At this point I'm not too concerned with speed.

Thanks for your help.

Your first question will be answered if you go back and transcribe the variance formula correctly and then refer to the definition of $\mu_{(i)}$ at the end of section 2. The second question is hard to understand: $p_j$ is a *count* equal to $K-1 + M*K + 1$. There's nothing left to estimate! — whuber, Jul 15 '11 at 15:43
First question, still don't get it. My mistake was subscript $x_i$? Second question. Thank you, that was my reading comprehension. I thought it was sum of probabilities .... Instead as you point out it's sum of K-1 and other terms. — Budric, Jul 15 '11 at 15:51

whuber · Accepted Answer · 2011-07-15T16:11:24.183

4

Let the clusters be indexed by $j = 1, \ldots, K$ with $K_j \gt 0$ points in cluster $j$. Let $\mu_j$ (no parentheses around the subscript) designate the mean of cluster $j$. Then, because by definition $\mu_{(i)}$ is the mean of whichever cluster $x_i$ belongs to, we can group the terms in the summation by cluster:

$$\eqalign{ \sigma^2 &= \frac{1}{R-K}\sum_{i}(x_i - \mu_{(i)})^2 \\ &= \frac{1}{R-K}\sum_{j=1}^K\sum_{k=1}^{K_j}(x_k - \mu_j)^2 \\ &= \frac{1}{R-K}\sum_{j=1}^K K_j \frac{1}{K_j}\sum_{k=1}^{K_j}(x_k - \mu_j)^2 \\ &= \frac{1}{R-K}\sum_{j=1}^K K_j \sigma_j^2 },$$

with $\sigma_j^2$ being the variance within cluster $j$ (where we must use $K_j$ instead of $K_j-1$ in the denominators to handle singleton clusters). I believe this is what you were expecting.

edited Jul 15 '11 at 16:11

answered Jul 15 '11 at 16:00

whuber

281,159
54
637
1,101

Thanks, no this is not what I was expecting at all. I was expecting to compute covariance matrix for each j - dimension M > 1. Use $\mu_j$ along with covariance j, in its own multi variate Gaussian distribution to compute point probability. And since the means add M*K free parameters. All the covariance matrices would add M*M*K more parameters. – Budric Jul 15 '11 at 16:10
@Budric Well, that's not what they are doing! Obviously they are assuming a common covariance structure around each center. This proves your original interpretation is correct. But why does a common covariance "make no sense"? It's a good place to start. You can do *post hoc* tests to see whether it's a reasonable assumption. If it's not, then it sounds like you might need to generalize the X-means approach yourself :-). – whuber Jul 15 '11 at 16:23

X-mean algorithm BIC calculation question

1 Answers1

Linked