How to determine optimal number of clusters?

Question

For a multilabel dataset, i would like to find the number of clusters involved in it. The below example gives more details about the problem:

Label_A: feature values
Label_B: feature values
Label_A, Label_C: feature values 
Label_C: feature values ... etc

We have say $n$ datarecord. Label field may have single label/multilabel(as in the case of record 3).

I would like to determine the number of cluster involved in the data. Assuming number of label as the number of cluster results in bad accuracy. This is because there may be case where single label can have multiple cluster. In this case, if we can find more cluster and assign two or more cluster to same label, we can increase the accuracy.

Hence, how do you find the number of cluster present in the multilabel data?

see this related question on stopping criteria for cluster analysis: http://stats.stackexchange.com/questions/2597/what-stop-criteria-for-agglomerative-hierarchical-clustering-are-used-in-practice — Jeromy Anglim, Apr 28 '11 at 15:08
I've answered a similar Q with half a dozen methods (using `R`) over here: stackoverflow.com/a/15376462/1036500 — Ben, May 13 '13 at 04:54

score 3 · Accepted Answer · edited Apr 13 '17 at 12:44

3

You can convert the labels into features indicating if the label is present or not. After that you can use various clustering algorithms and their corresponding methods to find out the number of clusters.

EDIT: I understood that your difficulty was handling the multiple labels and I suggested a solution for that. Your question did not mention that you wanted to use the k-means algorithm. The number of k-means clusters question has been answered here: How to define number of clusters in K-means clustering?. For hierarchical clustering the answer is here: Where to cut a dendrogram?. But there are many other clustering methods available: Choosing a clustering method.

edited Apr 13 '17 at 12:44

Community

1

answered Apr 28 '11 at 11:40

GaBorgulya

3,253
15
19

Could you please elaborate on your answer. I couldnt get your point of **use various clustering algorithms and their corresponding methods to find out the number of clusters**. If I choose to run k-means on the new dataset, i have to provide **k** as an input. My Q is what value of **k** of best for the original dataset. – Learner Apr 28 '11 at 15:01
@Learner I've edited my post to reply to your comment. – GaBorgulya Apr 28 '11 at 19:13

score 0 · Answer 2 · answered Jun 06 '19 at 20:39

There are many ways developed for this important question, but the most common and basic way is using the elbow method. Basically, it works by tracking the change of variance of a metric, when you increase the number of clusters. There are also some other ways which discussed here: https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set

How to determine optimal number of clusters?

2 Answers2

Linked