3

Are there clustering algorithms that take advantage of bootstrap?

For example can one combine bootstrap with a standard K-Means algorithm to scale K-Means.

I was thinking if the following at a high-level:

  1. Use bootstrap to create sub-samples from the elements that need to be clustered.
  2. Each of these sub-samples should be the size that K-Means scale
  3. K-Means each sub-sample in parallel
  4. Use the co-occurrence of elements in clusters to extract clusters from the larger population

Thanks.

user1172468
  • 1,505
  • 5
  • 21
  • 36

2 Answers2

7

If you're still interested in applying bootstrap approach to clustering, I would recommend you to read the following two excellent and IMHO relevant answers: one on applying bootstrapping to clustering (https://stats.stackexchange.com/a/11702/31372) and another on determining optimal number of clusters and more (https://stackoverflow.com/a/15376462/2872891). In addition to some fit measures, mentioned in the above-referenced answers, I'd also suggest to use AIC and/or BIC, as described in answers to this question, with more details here.

There is a significant amount of research literature on using bootstrapping in clustering (for example, see this and this). Despite being beyond my current knowledge level, interest and the scope of this answer, I would like to point to two specifically interesting resources. This paper on using bootstrapping for speeding up K-means clustering might be also of interest, but requires "manual" implementation. This blog post seems to me interesting and relevant to the topic as well.

As a side note, I would recommend you to take a look at an alternative (non-K-means) approach to clustering, called model-based clustering, as implemented in mclust R package. The approach, methods and software are described in the paper "mclust Version 4 for R: Normal Mixture Modeling for Model-Based Clustering, Classification, and Density Estimation" by Chris Fraley, Adrian E. Raftery, T. Brendan Murphy and Luca Scrucca: http://www.stat.washington.edu/research/reports/2012/tr597.pdf.

Aleksandr Blekh
  • 7,867
  • 2
  • 27
  • 93
1

Usually, it will be good enough to run k-means on a single sample.

So yes, one can combine k-means with sampling, in two ways:

  1. run k-means on the sample, and use the result.

  2. run k-means on the sample, use the resulting centers, and continue on the full data set (which should usually converge within a small number of iterations now).

I'm not convinced that repeating the process with multiple samples is essential, or that performing a more complex sampling helps. It's a best practise to run it multiple times, and it may work well to use different samples then each time.

Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96