0

I was testing some clustering validity indexes with Iris Dataset and I got something odd with scikit learn. The silhouette index is giving a better index for 2 clusters instead of 3 clusters (the real or natural number of partitions).

python 3.6.9 and scikit-learn 0.24.2

from sklearn import datasets
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.metrics import davies_bouldin_score, calinski_harabasz_score

iris = datasets.load_iris()
nr_clusters = 2
km = KMeans(n_clusters=nr_clusters).fit(iris.data)
print(f'Silhouette Score(n={nr_clusters}): {silhouette_score(iris.data, km.labels_)}')
print(davies_bouldin_score(iris.data, km.labels_))
print(calinski_harabasz_score(iris.data, km.labels_))

Result:

Silhouette Score(n=2): 0.681046169211746
0.40429283717304365
513.9245459802768

If I run the k-means with the correct number of clusters, it will give a worse silhouette index.

nr_clusters = 3
km = KMeans(n_clusters=nr_clusters).fit(iris.data)
print(f'Silhouette Score(n={nr_clusters}): {silhouette_score(iris.data, km.labels_)}')
print(davies_bouldin_score(iris.data, km.labels_))
print(calinski_harabasz_score(iris.data, km.labels_))

Result:

Silhouette Score(n=3): 0.5528190123564091
0.6619715465007511
561.62775662962

Probably, there is something wrong with my environment or I missed some point. I tried to run an equivalent test using R and I got the correct value, that is, a k=3 clustering got a better silhouette than k=2.

My question is not related to programming/algorithm. It's about the silhouette index nature and why it didn't run well for a traditional case.

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
Josir
  • 101
  • 2
  • To compare a clustering index results, you should have presented clustering partition of the data. Instead of just saying, "I did K-means". – ttnphns Nov 16 '21 at 08:58
  • Hi ttn. The dataset I presented is "iris dataset". Probably, the most used dataset in statistic history. I thought it was not necessary due to its "fame". – Josir Nov 16 '21 at 17:59
  • you did clustering of it. Where is the output showing which data point which cluster belongs to? – ttnphns Nov 16 '21 at 18:26
  • What is the "correct" number of clusters? Iris dataset contains sooner 2 clusters than 3 clusters, and it is normal that some clustering validity criteria, if not majority of them, will "vote" for the 2-cluster solution. – ttnphns Nov 16 '21 at 22:14
  • The correct number is 3. The output can be viewed in the vast literature about the iris dataset. This one is very didatic: https://www.geeksforgeeks.org/exploratory-data-analysis-on-iris-dataset/ – Josir Nov 17 '21 at 12:15
  • Oh, gosh. Josir. It seems to me you hardly understand what cluster analysis is and how internal validation (with its 100+ competing criteria) in it is different from external validation. With the internal validation, start maybe from [here](https://stats.stackexchange.com/a/358937/3277). – ttnphns Nov 17 '21 at 13:12

1 Answers1

2

There is nothing "wrong" with it.

First of all, R's and Python's implementations of the algorithm may differ, hence they may give different results. Second, $k$-means is a randomized algorithm. It is not fully deterministic because it starts with randomly initialized clusters, so if you run it several times you could get different results for each run.

Moreover, $k$-means is a clustering algorithm and you are applying it to labeled data, but there is no reason to believe that the clustering algorithm will learn to classify. For example, say that you have health data of patients with different diseases (labels) together with some demographic data. The clustering algorithm could learn perfectly reasonable clustering solutions that focus on the demographics while scattering the medical conditions over all the clusters. So the fact that Iris data has three labels does not mean that the only way to cluster it is three classes.

Tim
  • 108,699
  • 20
  • 212
  • 390
  • Thanks Tim. I understood that for an hypothetic dataset, the randomness condition applies. But the Silhouette should give a better value for k=3 on any circunstance on a "simple" dataset like Iris, like the davies_bouldin_score and calinski_harabasz_score did. But if this is a fact like you said, is it possible to minimize the random effect of k-means ? – Josir Nov 15 '21 at 23:18
  • @Josir as stated above, no reason why it “should”. – Tim Nov 15 '21 at 23:26
  • I tried also the init='k-means++' and n_init=1000 parameters in order to minimize the centroid initial randomness and the results were always the same. I imagine that the result at some point should match the real cluster configuration due to nature of the dataset. And it will give a better silhouette index for k=3. – Josir Nov 15 '21 at 23:36
  • @Josir there’s nothing “real” about labels being the clusters. Say that you have dataset with 50 categorical columns, in such a case you could use any of the columns as labels. In such a case there would be 50 different “real” clusters and you would expect clustering algorithm to be able to learn each of them after using different random initialization? It’s literally seeking for a needle in a haystack. – Tim Nov 16 '21 at 06:43
  • Hi @tim. I mean "real" because it represents real objects. If you try to cluster a dataset that represents just "apples" and "bananas" and if the algorithm returns more than 2 clusters, the clustering quality is poor and it fails its purpose. Am I wrong? – Josir Nov 16 '21 at 18:03
  • 2
    A minor note about comparing R and sklearn with iris: there's a typo in the in the sklearn iris data that's never been fixed. I don't know if this is the cause of the discrepancy, but as a starting point, I would only compare two methods that are using the same data. – Sycorax Nov 16 '21 at 18:16
  • Thanks Sycorax. This could be the cause! I will try also to download the original CSV dataset. – Josir Nov 16 '21 at 18:19
  • 2
    @Josir say that it’s data on humans, you could have clusters by genders, age groups, height, jobs, and many more categories. I guess you can group flowers also to more categories than (arbitrary) botanical classification. – Tim Nov 16 '21 at 20:07