Initialization of Kmeans++ clusters

Question

I would like to confirm the initialization steps of my K-means++ implementation (steps which chose initial centers of clusters). I am wondering if my initialization scheme has been implemented correctly. Here is my pseudocode:

   select one data item at random as the first mean
   loop k-1 times
        compute distance from each item to closest mean
        select an item that has large distance-squared as next initial mean
   end loop

score 1 · Accepted Answer · answered May 02 '19 at 00:09

K-Mean++ is an algorithm for choosing the initial values (or "seeds") for the k-means clustering algorithm. It was proposed in 2007 by David Arthur and Sergei Vassilvitskii, as an approximation algorithm for the NP-hard k-means problem—a way of avoiding the sometimes poor clusterings found by the standard k-means algorithm.

The exact algorithm is as follows:

Choose one center uniformly at random from among the data points.
For each data point x, compute D(x), the distance between x and the nearest center that has already been chosen.
Choose one new data point at random as a new center, using a weighted probability distribution where a point x is chosen with probability proportional to D(x)2.
Repeat Steps 2 and 3 until k centers have been chosen.
Now that the initial centers have been chosen, proceed using standard k-means clustering.

Besides this, you may find the following answers helpful too; 1, 2, 3 and 4

Reference: http://ilpubs.stanford.edu:8090/778/1/2006-13.pdf

score 0 · Answer 2 · answered May 02 '19 at 01:05

0

K-means++ does not select the element with the largest distance.

It uses a weighted random to prefer farther elements.

answered May 02 '19 at 01:05

Has QUIT--Anony-Mousse

39,639
7
61
96

Initialization of Kmeans++ clusters

2 Answers2