3

I want to find the optimal number of clusters (k) for some clustering algorithm (agglomerative hierarchical, if that matters). I want to evaluate the quality of clustering based on some metric for each possible cluster (k=1, 2, 3, ...). Since the number of possible k's can be as large as the number of data points, it is not practical to check the evaluation metric for each k. Therefore, if there exists a metric which when plotted against k, is independently monotonic on either side of the most "optimal" k (i.e., the k for which the metric's value is highest), then it would be possible to ask the question, "Can we reduce the search space for k?". That is, instead of evaluating the metric for all k, is it possible to identify a subset or neighbourhood where the best k is guaranteed to be present?

If such a metric exists, the graph would look something like this:

enter image description here

I tried different metrics that I could find by online search: C-index, Dunn index, Davies-Bouldin index, Calinski-Harabasz index, Silhouette Width. But none of them have guarantee on independent monotonicity on either side of the optimal k for ALL datasets - maybe some of them behave like this on some data, but not on other data. For example:

enter image description here

So, does such a metric exist? Or is it possible to construct one?

Kristada673
  • 241
  • 2
  • 8
  • 1
    1) No such indices exist because the curve shape depends on the data. 2) No such index is needed because it is normal to expect that in your dataset may, say, k=4 is the best, k=6 is much worse, but k=7 is again better. There can be more than one internally good clustering solutions with different k. 3) Internal clustering criterions are heuristic devices. Pay attention to elbows also or rather than to absolute maxima (or minima). – ttnphns Jun 28 '17 at 09:53
  • (cont.) 4) Typically, we want few clusters. Clustering criterions such as AIC or BIC explicitly penalize for excessive k; they also tend to produce quite smooth curves - that might please you, perhaps. – ttnphns Jun 28 '17 at 09:54
  • @ttnphns I see. How about external evaluation criteria? Could they be constructed to guarantee independent monotonicity on either side of the best k? – Kristada673 Jun 28 '17 at 09:58
  • External validation is another approach in [cluster validation](https://stats.stackexchange.com/a/195481/3277), it is "supervised". So strange why you decided now to move from one to another. – ttnphns Jun 28 '17 at 10:02
  • 1
    The very question of yours is strange. It is silly, I'd say, to think there is always must be one best k in a cluster structure, except when clusters are very clear and far apart - which is very seldom in practice. – ttnphns Jun 28 '17 at 10:08
  • Not so strange really to ask the question. All I need is a measure that guarantees independent monotonicity on either side of the best k - I don't care if its intrinsic or external. The motivation is to not have to run the evaluation on all possible k, which if possible would be great. So, I don't think it's a silly question. – Kristada673 Jun 28 '17 at 10:15
  • Also, I don't think external metrics have to be supervised (perhaps what I mean by external is not what you mean by external, don't know). For example, say you're clustering kids, and the cluster evaluation metric is the sum of the difference in the average running speed of the kids in all the clusters. Now, of course this metric doesn't guarantee monotonicity for all datasets (if I knew a metric that does, I'd not be asking this question), but it is possible that it would be monotonic for some data instances, isn't it? – Kristada673 Jun 28 '17 at 10:15
  • For your comment but last above. It is unusual to check all k through 2 to n-1 (n is the number of points). People usually check k= 2 through 15 or 20 only. Except rare special cases. For the last comment: please see my definition of external validation in my link. Your question is all right, generally; still, it is a bit strange to me because I never could imagine when I might want or insist on such monotonicity. – ttnphns Jun 28 '17 at 10:23

1 Answers1

2

This is impossible to guarantee for any useful measure, in particular when you don't have control over the clustering algorithm.

But consider the following scenario:

You have a data set of 10 clusters, but these come in pairs each. The best solution is k=10, but the second best clustering should be the one with k=5. So a good quality measure will have a second local maximum at k=5.

We can of course propose many nonsense measures. For example, the function -|N/k-10| will have the properties you asked for, but I'd rather not use it on real data...

Beware that there is usually no best k. That is only an illusion that there would be an objective best k, because every user may have a different need. In reality you will find some interesting k, investigate, repeat. It is an explorative task.

Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96