4

I am trying to do clustering on a distance matrix which contains numeric data. But I am not sure how to decide upon the number of clusters or value k for clara function in R. But after running it with some random number of clusters, I ran silhouette function on it and summary gives me like this:

Cluster sizes and average silhouette widths:

           7            3            4            5            7            4 
 0.222273330 -0.001592881  0.117937463  0.121326365  0.137911639  0.161932689 
Individual silhouette widths:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.10410  0.08961  0.12500  0.14140  0.19840  0.30580 

This is the result for value of k=6. If I change it to say 5 or 4, I obtain silhouette for each cluster and also mean value. How do I decide upon the number of clusters? Do I need to plot like mean silhouette vs k? How do we do something like this in a large dataset with around million observations?

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
user2542275
  • 717
  • 2
  • 6
  • 17

1 Answers1

0

You can do one of these two things :

  1. Use fviz_nbclust() function like this
        fviz_nbclust(data, clara, method = "silhouette", 
             k.max = yourMaxValue)+theme_classic()
  1. You could construct a graph by accessing silhouette width info in clara object.
        # If clara.res is the object resulting from using clara.
        clara.res$silinfo

Hope this helps.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467