0

I'm performing a kmeans clustering on a 22.000 documents datasets. Not knowing how many clusters I should get, I ran different k values and try to assess the validity of the clusters by determining the silhouette coefficient.

Here are the results:

10 clusters => s = 0.248
15 clusters => s = 0.278
20 clusters => s = 0.306
50 clusters => s = 0.387
200 clusters => s = 0.498
1000 clusters => s = 0.670

It seams ridiculous to me as 1000 clusters for a 22.000 dataset is way too much... and of course if I continue like that I will get s=1 for 22.000 clusters (prooving that each document is not a duplicate of any other)...

How can I evaluate my results to determine the best amount of clusters to set for the clustering ?

Vincent Teyssier
  • 203
  • 1
  • 12
  • You probably have too many unusal documents (outliers). That's why the results keep on getting "better". k-means on text never works well. It produces "something", but not much better than random. – Has QUIT--Anony-Mousse Apr 15 '16 at 20:55
  • Might be, but when I do a 20 clusters run and look at the top terms of each clusters I clearly see some categories I could label. Can it be because I got too much noise? I mean that I have not filtered enough the meaningless terms in my stop words list? And if kmean doesn't work great, what would you suggest to use for a 400.000 documents dataset where I have no idea how many and what would be the clusters? – Vincent Teyssier Apr 15 '16 at 21:05
  • Search for "reading tea leaves". Top term lists tend to look much better than the clusters are... Try labeling the cluster, but then check many documents from it, too (without updating your label because you erred). In particular, look at the least central documents. – Has QUIT--Anony-Mousse Apr 15 '16 at 21:16
  • Lol, ok ;) so you don't think noise is influing ? Or maybe the minimum occurrence threshold should be higher (using the default 2 for the moment). I'm just trying to figure out what tweaking would get a clearer delimitation between clusters – Vincent Teyssier Apr 15 '16 at 21:22
  • stopwords != noise. What I consider noise are unusual documents. With text, you usually have plenty of one-of-a-kind texts. – Has QUIT--Anony-Mousse Apr 16 '16 at 12:58
  • Ok thanks for définition. I have tried by enlarging my training set and the silhouette decrease, giving another clue of too many outliers. Thanks for helping anyway – Vincent Teyssier Apr 20 '16 at 09:48

0 Answers0