2

How can i know for which clustering algorithms (with a parameter that represents number of clusters) it makes sense to use the Gap statistic? I've read in the paper by Tibshirani, Walter & Hastie that

It is designed to be applicable to virtually any clustering method.

But then the authors proceed in the theoretical part to

For simplicity [...] focus on the widely used K-means clustering procedure.

My question is, what are the procedures for which it can really be applied? What changes do i need to make (if any) when applying the Gap statistics to other procedures? Should i choose different measures of distance (as opposed to defaulting to the euclidian used for K-means) for different procedures?

To provide a specific list of algorithms i am curious about:

  • k-modes & k-prototypes - Does it make sense to use Gap statistic with a different distance measure? Specifically, using a distances related to the cost functions used by these two algorithms?
  • Ward hierarchical clustering
  • Spectral clustering - is there any way to make gap statistic useful for selection of clusters in spectral clustering? I am not really sure if i should just swap euclidian distance for some other measure (if so, which?), keep using euclidian distance, or there simply is not a way to make gap statistic meaningful.

I am sure that after reading my question the first thought will be that it really depends on what i mean by the words "useful", "meaningful", "right" and "work", but putting this aside, i am looking for systematic ways how to choose number of clusters. I would like these ways of finding number of clusters not to be irrational and would like to avoid a scenario where i do something that is widely considered a bad approach.

ira
  • 399
  • 2
  • 14
  • Gap clustering criterion is suitable to validate cluster solutions of any cluster analysis. The index is akin to ANOVA-based ones such as Calinski-Carabasz (https://stats.stackexchange.com/a/358937/3277). Therefore, it is for a quantitative dataset. – ttnphns Aug 23 '20 at 10:12
  • @ttnphns And is it suitable even if i use the euclidian distance as a measure of distance? Or should i use the distance that is used by the respective algorithm? I suppose that at the very least for categorical variables the euclidian distance doesn't make much sense and therefore at least for k-modes and k-prototypes i need a to use different measure of distance? – ira Aug 23 '20 at 11:42
  • Gap is OK for euclidean distance (if a program can compute it from the distance matrix at all - I don't remember now). Gap is not for categorical data. – ttnphns Aug 23 '20 at 14:54

0 Answers0