-1

I would like to confirm the initialization steps of my K-means++ implementation (steps which chose initial centers of clusters). I am wondering if my initialization scheme has been implemented correctly. Here is my pseudocode:

   select one data item at random as the first mean
   loop k-1 times
        compute distance from each item to closest mean
        select an item that has large distance-squared as next initial mean
   end loop

2 Answers2

1

K-Mean++ is an algorithm for choosing the initial values (or "seeds") for the k-means clustering algorithm. It was proposed in 2007 by David Arthur and Sergei Vassilvitskii, as an approximation algorithm for the NP-hard k-means problem—a way of avoiding the sometimes poor clusterings found by the standard k-means algorithm.

The exact algorithm is as follows:

  1. Choose one center uniformly at random from among the data points.
  2. For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.
  3. Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x)2.
  4. Repeat Steps 2 and 3 until k centers have been chosen.
  5. Now that the initial centers have been chosen, proceed using standard k-means clustering.

Besides this, you may find the following answers helpful too; 1, 2, 3 and 4

Reference: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf

mnm
  • 205
  • 3
  • 8
0

K-means++ does not select the element with the largest distance.

It uses a weighted random to prefer farther elements.

Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96