4

I am clustering data using k-medoid. I used Davies–Bouldin index for $2$ to $n-1$ clusters. Here $n = 100$ (using smaller test case). I find minimal value of the index for 98 clusters. But the overall accuracy rate for 98 cluster is very small (smaller than 1). Here accuracy rate is how accurately test data is matched to training data. What should I do in that situation. If dataset is larger then finding Davies–Bouldin value from $2$ to $n-1$ is large task. What should I do for larger dataset?

Here is my plot of Davies–Bouldin index value for cluster solutions [X axis "index" is actually the number of clusters in a solution].

enter image description here

ttnphns
  • 51,648
  • 40
  • 253
  • 462
Diptopol Dam
  • 185
  • 1
  • 7
  • Please post here the curve of DB values of 100- to 2-cluster solutions, so we can see. Also, [this post](http://stats.stackexchange.com/q/52838/3277) might be informative reading. – ttnphns Apr 13 '13 at 21:41
  • @ttnphns I added my DB value plot – Diptopol Dam Apr 13 '13 at 22:21
  • The index suggests you 4 or 5 clusters (i.e. 3rd or 4th point from the left). I recommend you to check also other indices, if you have time, - e.g. Calinski-Harabasz (which is similar to Davies-Bouldin) or point-biserial correlation or C-Index or Silhouette index (which are not based on ANOVA ideology). – ttnphns Apr 13 '13 at 22:38
  • @ttnphns db_value for cluster 4 is 3.661918 and cluster 98 is 0.020936. Can davis-bouldin differ from other methods ? – Diptopol Dam Apr 13 '13 at 23:03
  • Mate, FYI, it is Davies–Bouldin, not Davis. – sashkello Apr 14 '13 at 01:50
  • @DiptopolDam, _did_ you read attentively a linked post in my 1st comment? In an answer there is stated clearly, and with example, that one should not blindly trust min or max of a clustering index value. Bends found on the values profile are much more important. Of course, you clearly have 4-5 clusters, and not 98, out of 100! – ttnphns Apr 14 '13 at 06:39
  • @ttnphns I have clearly understood now. But for large data set do I have to get davies bouldin index for all possible cluster combination or if I find a good bend I can stop there. – Diptopol Dam Apr 14 '13 at 07:14
  • No, of course not. Consider you have 1000 objects. Will you really want to check if 900 (or 500 or 100) cluster solution is "good"? No. We usually wish to split the data into somewhere 2 to 10 clusters. This means that in is enough to compute and to plot values from 2 to, say, 20 clusters - to be able to see what's going on with the curve on the plot. – ttnphns Apr 14 '13 at 07:38
  • Can you plz tell me in which tool you made the graph of no of clusters vs davies bouldin? – user1015347 Jul 18 '13 at 07:17
  • @user1015347 use R (http://www.r-project.org/) – Diptopol Dam Jul 18 '13 at 09:24

0 Answers0