I'm having trouble understanding some of the formulas in this paper related to BIC calculation (Dan Pelleg and Andrew Moore, X-means: Extending K-means with Efficient Estimation of the Number of Clusters).
First the variance equation:
- R - number of points
- K - number of clusters
- $\mu_i$ - centroid associated with ith point.
- $\sigma^2 = \frac{1}{R-K}\sum_{i}(x_i - \mu_{(i)})^2 $
The log likelyhood then uses this sigma. Am I reading this right, they're using 1 covariance matrix for all clusters (see quote below, they are)? This makes no sense. If you have 5 clusters, each one is a Gaussian according to k-means algorithm. So wouldn't it make sense to compute covariance $\sigma^2_i$ for each cluster and use that?
My second question is regarding number of parameters to use in the BIC score. The paper mentions
Number of free parameters $p_j$ is simply the sum of K-1 class probabilities, M*K centroid coordinates, and one variance estimate.
How do you get the K-1 class probabilities? I could do # of points in class i / total number of points. But then it's K-1, which probability is left out of the sum?
P.S. If anyone has a nicer paper on estimating k using similar methods I'd like to read that as well. At this point I'm not too concerned with speed.
Thanks for your help.