I have implemented kmeans clustering on iris dataset (inbuilt dataset) in R. The code is given below:
X=as.matrix(iris[-5]);
K=3;
prevCentroids=matrix(0,K,dim(X)[2]);
centroids=X[sample(1:dim(X)[1],K),];
dot=numeric(3);
C=numeric(150);
while(!isTRUE(all.equal(centroids,prevCentroids)))
{
for(i in 1:dim(X)[1])
{
for(j in 1:dim(centroids)[1])
{
dot[j]=(X[i,]-centroids[j,])%*%(X[i,]-centroids[j,]);
}
C[i]=which.min(dot);
}
prevCentroids=centroids;
for(k in 1:K)
{
centroids[k,]=colMeans(X[which(C==k),]);
}
}
print(cbind(iris,C));
Sometimes, with this code, I get 85% clustering correct. But sometimes, I just get 37% clustering correct if I compare it with the already clustered inbuilt iris dataset.
Could anyone please tell me where I am going wrong?
Should I use set.seed() here to perform partitions in similar manner everytime?? Is this how one should implement kmeans? With set.seed()? So that everytime we get correct clustering?
If yes, then how can we set.seed() in unsupervised learning? How will we understand which set.seed() generates the best set of random variables?