1

I have implemented kmeans clustering on iris dataset (inbuilt dataset) in R. The code is given below:

X=as.matrix(iris[-5]);

K=3;

prevCentroids=matrix(0,K,dim(X)[2]);
centroids=X[sample(1:dim(X)[1],K),];

dot=numeric(3);
C=numeric(150);

while(!isTRUE(all.equal(centroids,prevCentroids)))
{
 for(i in 1:dim(X)[1])
 {
   for(j in 1:dim(centroids)[1])
    {
      dot[j]=(X[i,]-centroids[j,])%*%(X[i,]-centroids[j,]);
    }
      C[i]=which.min(dot);
 }

   prevCentroids=centroids;

   for(k in 1:K)
    {
     centroids[k,]=colMeans(X[which(C==k),]);
    }
}

print(cbind(iris,C));

Sometimes, with this code, I get 85% clustering correct. But sometimes, I just get 37% clustering correct if I compare it with the already clustered inbuilt iris dataset.

Could anyone please tell me where I am going wrong?

Should I use set.seed() here to perform partitions in similar manner everytime?? Is this how one should implement kmeans? With set.seed()? So that everytime we get correct clustering?

If yes, then how can we set.seed() in unsupervised learning? How will we understand which set.seed() generates the best set of random variables?

  • Read this https://stats.stackexchange.com/questions/133656/how-to-understand-the-drawbacks-of-k-means/133694#133694 – Nikolas Rieble Jan 11 '18 at 13:05
  • 1
    K-Means is very sensitive to initialisations, i.e. the random cluster centers used at the beginning. Hence it is not surprising that it gives different clustering performance. Hence typically, K-means is run multiple times with different initialisations and then the best one is chosen based on some clustering metric. – kedarps Jan 11 '18 at 17:52

1 Answers1

2

The two comments to the question are very insightful. I am just expanding to to make sure this question has an answer.

An important concept for OP to review is local minima / saddle point vs. global minima in optimization context.

If the objective function is convex, for example, linear regression with squared loss, logistic regression, then, the fitted model is unique.

On the other hand, if the objective function is non-convex, such as sum of the distance in K-means, the fitted model is not unique. It will depends on the initialization of the parameters.

Therefore, it is perfectly normal to see different runs with different results on K means. Also really strongly recommend you to read this.

How to understand the drawbacks of K-means

Also, take a look at R kmeans documentation here, the nstart parameter gives number of random repetitions to over come this local minima / saddle point problem.

Haitao Du
  • 32,885
  • 17
  • 118
  • 213