0

I've programmed the kmeans++ algorithm in java, it's the regular kmeans but choosing the initial clusters in a smarter way rather than just random. I've made some tests and this seems to be correctly programmed.

But for my purposes I need the computer to decide the K value so another algorithm computes k-means++ with k from 2 to 7 and then choose the one with the maximum value of BIC:

int d = 1; //dimension
double variance = (double) (1.0 / (N - m)) * SSE(dataset,labels,centroids);
double penalty = 0.5 * m * Math.log(N) * (d + 1);
double likelihood = 0;
int i = 0;
while (i < m) {
    double ni = grupos[i].length;
    likelihood += (ni * Math.log(ni) - ni * Math.log(N) - (ni * 1 * 
    0.5) * Math.log(2 * Math.PI * var) - (ni - 1) * 1 * 0.5);
    i++;
}
return likelihood - penalty; //BIC

The problem is: sometimes (but not purely random) this method gives me different values of K each time I run.

I've tested this method with simple examples which i know the K and it returns the right value of K and the data is correctly clustered.

Is this suppose to vary sometimes with more complex data?

Bernardo Braga
  • 125
  • 1
  • 8
  • I will not be checking your code, but will note that traditionally BIC is computed as "lower is better" form, so why did you select the highest value? There is a number of questions/answers about using BIC for clustering, on the site. Search to read. – ttnphns Nov 06 '17 at 23:32
  • Xmeans is more than just kmeans for k=2..7 and BIC. – Has QUIT--Anony-Mousse Nov 07 '17 at 08:32
  • @ttnphns because i have bic_1=loglikelihood-penalty instead of bic_2=-loglikelihood+penalty. Maximizing bic_1 is the same as minimazing the tradicional bic_2 – Bernardo Braga Nov 07 '17 at 09:40
  • Compare your calculations with, for instance, https://stats.stackexchange.com/q/55147/3277 or other answers. – ttnphns Nov 07 '17 at 10:27

0 Answers0