Use of bootstrap in clustering algorithms

Question

Are there clustering algorithms that take advantage of bootstrap?

For example can one combine bootstrap with a standard K-Means algorithm to scale K-Means.

I was thinking if the following at a high-level:

Use bootstrap to create sub-samples from the elements that need to be clustered.
Each of these sub-samples should be the size that K-Means scale
K-Means each sub-sample in parallel
Use the co-occurrence of elements in clusters to extract clusters from the larger population

Thanks.

score 7 · Accepted Answer · edited May 23 '17 at 12:39

If you're still interested in applying bootstrap approach to clustering, I would recommend you to read the following two excellent and IMHO relevant answers: one on applying bootstrapping to clustering (https://stats.stackexchange.com/a/11702/31372) and another on determining optimal number of clusters and more (https://stackoverflow.com/a/15376462/2872891). In addition to some fit measures, mentioned in the above-referenced answers, I'd also suggest to use AIC and/or BIC, as described in answers to this question, with more details here.

There is a significant amount of research literature on using bootstrapping in clustering (for example, see this and this). Despite being beyond my current knowledge level, interest and the scope of this answer, I would like to point to two specifically interesting resources. This paper on using bootstrapping for speeding up K-means clustering might be also of interest, but requires "manual" implementation. This blog post seems to me interesting and relevant to the topic as well.

As a side note, I would recommend you to take a look at an alternative (non-K-means) approach to clustering, called model-based clustering, as implemented in mclust R package. The approach, methods and software are described in the paper "mclust Version 4 for R: Normal Mixture Modeling for Model-Based Clustering, Classification, and Density Estimation" by Chris Fraley, Adrian E. Raftery, T. Brendan Murphy and Luca Scrucca: http://www.stat.washington.edu/research/reports/2012/tr597.pdf.

[This question and answer](http://stats.stackexchange.com/q/43185/31372) might also be of interest. — Aleksandr Blekh, Dec 25 '14 at 12:15

score 1 · Answer 2 · answered Dec 25 '14 at 10:53

Usually, it will be good enough to run k-means on a single sample.

So yes, one can combine k-means with sampling, in two ways:

run k-means on the sample, and use the result.
run k-means on the sample, use the resulting centers, and continue on the full data set (which should usually converge within a small number of iterations now).

I'm not convinced that repeating the process with multiple samples is essential, or that performing a more complex sampling helps. It's a best practise to run it multiple times, and it may work well to use different samples then each time.

Use of bootstrap in clustering algorithms

2 Answers2

Linked