I was testing some clustering validity indexes with Iris Dataset and I got something odd with scikit learn. The silhouette index is giving a better index for 2 clusters instead of 3 clusters (the real or natural number of partitions).
python 3.6.9 and scikit-learn 0.24.2
from sklearn import datasets
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.metrics import davies_bouldin_score, calinski_harabasz_score
iris = datasets.load_iris()
nr_clusters = 2
km = KMeans(n_clusters=nr_clusters).fit(iris.data)
print(f'Silhouette Score(n={nr_clusters}): {silhouette_score(iris.data, km.labels_)}')
print(davies_bouldin_score(iris.data, km.labels_))
print(calinski_harabasz_score(iris.data, km.labels_))
Result:
Silhouette Score(n=2): 0.681046169211746
0.40429283717304365
513.9245459802768
If I run the k-means with the correct number of clusters, it will give a worse silhouette index.
nr_clusters = 3
km = KMeans(n_clusters=nr_clusters).fit(iris.data)
print(f'Silhouette Score(n={nr_clusters}): {silhouette_score(iris.data, km.labels_)}')
print(davies_bouldin_score(iris.data, km.labels_))
print(calinski_harabasz_score(iris.data, km.labels_))
Result:
Silhouette Score(n=3): 0.5528190123564091
0.6619715465007511
561.62775662962
Probably, there is something wrong with my environment or I missed some point. I tried to run an equivalent test using R and I got the correct value, that is, a k=3 clustering got a better silhouette than k=2.
My question is not related to programming/algorithm. It's about the silhouette index nature and why it didn't run well for a traditional case.