1

I use scikit-learn to get IRIS and WINE clusters for evaluating an algorithm for K-means clustering. The K-means algorithm is a heuristic algorithm for solving the "minimum-sum-of-squares-clustering (MSSC)" problem, that is, it does not guarantee to get an optimal solution for the MSSC. Therefore, I was wondering where I can find the optimal solutions of the IRIS and WINE instances of MSSC? Are you aware of any data sets for which the optimal clusters are available?

If such optimal solutions are not available, then I am wondering how different algorithms of K means are currently being compared?

rasul
  • 111
  • 4
  • Is it that you want a surely optimal (in the SSwithin=min sense) solution of these datasets? – ttnphns Aug 27 '19 at 08:47
  • K-means iterations always monotonically decreese SSwithin and converge on the optimum. The problem is that this found optimum is true for a specific set of initial centroids and not necessarily for any possible set of initial centroids. The key, therefore, is in coming across the best initial centroids. – ttnphns Aug 27 '19 at 08:59
  • Try various methods of initializing https://stats.stackexchange.com/q/317493/3277. If many of them lead to the same final centroids (i.e., same solution), the solution is surely (though not sure) the globally optimal. – ttnphns Aug 27 '19 at 09:06
  • @ttnphns Yes, I am looking for a global optimum. So, if the starting point is chosen properly, K-means will always converge to a global optimum? For example, if we happen to use an optimal solution as the initial point then the algorithm stops immediately. Is that right? I was wondering if you have a reference for some conditions for the initial solution that guarantee convergence to an optimum? – rasul Aug 27 '19 at 11:57
  • 2
    The trivial case is when the initial centres = the final centroids of the global optimal solution; then 0 iterations are needed. Apart from that case, nobody can tell for sure that the solution will be global optimal. But with not big datasets you can experiment with various initial seeds [and finally come](https://stats.stackexchange.com/a/51825/3277) to a solution _almost_ 100% sure to be the global optimum. – ttnphns Aug 27 '19 at 14:01

1 Answers1

0

The IRIS dataset has labels. These are the ground truth values.

You could evaluate whether your algorithm defines the same groups as the actual labels when running the algorithm on the features only.

If your algorithm defines three groups and these perfectly coincide with the three labels, you could say that your algorithm is working well.

Mark Verhagen
  • 564
  • 2
  • 11
  • 1
    Expanding on that answer: If your clustering does not perfectly coincide with the correct clustering, there are similarity measures such as Jaccard coefficient, Fowlkes & Mallows Index or Rand Index you could use to compare them. – nope Aug 27 '19 at 08:16
  • Looks to me this doesn't answer the question. Iris classes need not to be the optimal clusters in the SSwithin=min sense. – ttnphns Aug 27 '19 at 08:35
  • I see what you mean. I guess I was referring to the classical way in which clustering on those sets is generally evaluated (the second question). If the question is whether there is a distribution of centroids for which the MSSC is minimal (and it of course matters then how distance is calculated and how many centroids are introduced), I wonder if someone ever grid searched this.. – Mark Verhagen Aug 27 '19 at 08:49