I am researching cluster analysis, and I am interested in variables that are both categorical and continuous, for which I have read that a Gower's similarity coefficient is a good proximity measure. I am interested in first using an average linkage algorithm, and have found that some have recommended looking for the 'elbow' in the sum of squared error (SSE) scree plot as a guideline for deciding how many clusters to retain. I was wondering if the Gower's similarity coefficient (being non-metric and non-Euclidean) would allow me to create an SSE scree plot, or if that didn't make sense statistically.
Asked
Active
Viewed 2,898 times
2
-
SS of deviations ("error") from _what_? – ttnphns Jul 18 '13 at 14:42
-
SSE being the squared distance between each member of a cluster and its cluster centroid. – Laura Jul 18 '13 at 14:54
-
No, centroids call for euclidean distance. They make little sense with Gower coefficient. Search this site for "clustering criterions" and "number of clusters" for further info. – ttnphns Jul 18 '13 at 15:04
-
Ah, thank you, that was exactly what I was looking for. – Laura Jul 18 '13 at 15:06
1 Answers
3
SSE is the measure optimized by k-means.
It doesn't make much sense for any other algorithm than k-means. And even there it suffers from the fact that increasing k will decrease SSE, so you can mostly look at which point further increasing k stops yielding a substantial increase in SSE - that is essentially the vague "elbow method".
There exist other criteria such as Silhouette, Davies-Bouldin index, BIC, AIC that can be used to get an "alternative view" of what is actually optimal.
But in the end, that is just a mathematical heuristic. It may not work for real data.

Nick Cox
- 48,377
- 8
- 110
- 156

Has QUIT--Anony-Mousse
- 39,639
- 7
- 61
- 96