Determine the number of clusters for K-means automatically

Question

Since a couple of days I research for a method to determine the number of clusters for K-means automatically, I found elbow method but I can not till now understand its principle.

Is there any algorithm or C++ code of elbow method or other simple method to determine the number of classes automatically

Please, can you define or describe the elbow method? And what exactly is "automatically"? — ttnphns, Oct 27 '15 at 19:46

mic · Answer 1 · 2015-10-27T15:10:14.583

There are plenty of indicators and tools to help you determine the number of clusters in your data. However, they may not be relevant for your dataset. Once your done with the elbow method, and the other indices you can easily find (for example, the R package NbClust and its documentation). You can try a Bayesian non-parametric approach such as the one presented in the DP-means paper which in practice turns to be a simple modification of the $k$-means algorithm. You need still to deal with the parameter $\lambda$ (a penalization term of your variance cost) which determines if new clusters are likely to pop out or not.

I think you can find many description of the elbow method, but in substance, you try several successive values of $K$ the number of clusters, and you plot the cost function value of the $k$-means for each of these $K$. If you can spot an elbow it indicates you the "right" number of clusters. Indeed, if you have a "wrong" $K$ your clusters are not meaningful and variance will decrease "smoothly", but if you go from a wrong $K_1$ to a "right" $K_2 = K_1 + 1$ you may spot a strong decrease in the variance of the clusters. Well, that's cooking.

If you are willing to try cooking recipes, you can also play with "stability" of your clusters. If you pick a wrong $K$ your clusters will not be stable, and many runs may lead to different clustering. For a right $K$, you may always find the same clustering. So, you can

run many k-means for several values of $K$
build a consensus matrix, that is a $N \times N$ matrix $M$ whose coefficient $M_{ij}$ says that $i$ and $j$ were put in the same cluster $M_{ij}$ times over your number of trials. 0 indicates that $i$ were never with $j$, 1 indicates that they were always put in the same cluster
run a hierarchical agglomerative algorithm such as Average or Complete Linkage on this similarity matrix

You will likely obtain a good dendrogram. Now, you can apply the elbow method to extract $K$ clusters according to the dendrogram (with dynamic programming cut or flat cut).

Good luck, this problem is rather well documented on the web, but unsolved in the general case (which may be nonsense to consider).

Although I certainly like the detailed answer I can't help wondering about some opaque points in it. They ought to _clear_ up. When you describe the stability-check approach to determine the best K you (1) introduce the `NxN` "consensus" matrix on which the hierarchical clustering (HC) is performed. OK, let `N` be not very huge for that. The immediate question is why not then directly compute the matrix of (squared) euclidean distances and just do the HC, instead of K-means? (to cont.) — ttnphns, Oct 27 '15 at 19:31
(cont.) I guess that one of possible answers might be "the HC of that consensus matrix will yield a clearer clusters on the dendrogram". The (2) why so? (3) A related question: `run many k-means for several values of K`, you say; how one is to separate the effect of the wrong K from the effect of the "bad" initial K-centres, on the "coefficients", in that mixture of trials? The whole idea of that consensus matrix has not been expained well enough by you. (to cont.) — ttnphns, Oct 27 '15 at 19:37
(cont.) (4) The elbow method itself was not defined by you. Typically, they mean specifically the plot of _SSwithin_ by this term. If you mean the same then here comes another question: how are you going to apply the SS idea to the `Average or Complete Linkage on this similarity` (consensus) matrix? — ttnphns, Oct 27 '15 at 19:41

Mike Hunter · Answer 2 · 2015-10-27T15:20:07.960

Another approach comes from some Bayesians at datamicroscopes here...

http://datamicroscopes.github.io/

They state that "These models rely on the Dirichlet Process, which allows for the automatic learning of the number of clusters in a dataset."

Also, this Google search returned a huge number of related publications:

https://www.google.com/?gws_rd=ssl#q=filetype:pdf+Online+distributed+algorithms+for+dimensionality+reduction%2C+clustering+and+feature+discovery+in+large+datasets

score 2 · Answer 3 · edited Jun 11 '20 at 14:32

Robert Tibshirani et al came up with a value called the Gap Statistic to estimate the number of clusters in a dataset. It involves calculating two quantities: The sum of the pairwise distances ($d$) (using some distance metric, e.g., squared euclidean is common) for all points in a cluster $C_r, r \in \{1,...,k\}$:, called $D_r$ (calculated for each cluster); and the pooled average pairwise difference $W_k$ over all clusters for the fit using $k$ clusters:

$$D_r = \sum_{i,i' \in C_r} d_{i,i'}$$

$$\hat W_k = \sum_{r=1}^k \frac{1}{2n_r} D_r $$

Where $n_r$ is the number of points in cluster $C_r$.

The gap statistic is:

$$\textrm{Gap}_n(k) = E_n[\log(W_k)] - \log(\hat W_k)$$

Where the expectation is taken with respect to samples of size $n=\sum_1^r n_r$ from some null reference distribution. Tibshirani, Hastie, and Friedman use a uniform distribution over the rectangle containing the data [see Elements of Statistical Learning, ed 2, p.519].

Note:

I added a hat to the observed $W_k$ to differentiate it from the theoretical value $E_n[\log(W_k)]$. You won't see this in the linked paper, as it is implicitly understood...but it can be confusing to someone new to this. Don't forget that $E_n[\log(W_k)]$ is the expected value of $W_k$ assuming that it is calculated using $n$ points drawn from the null reference distribution. It has nothing to do with your actual observed sample values, which is embodied in $\hat W_k$.

Even with this simple distribution, you'll likely need to simulate $W_k$ under the null reference distribution to calculate $E_n[\log(W_k)]$. Your goal is to calculate the gap statistic for various values of $k$ (the number of clusters) and see which value of $k$ which maximizes the gap statistic.

Can you explain why the standard deviation $s_k$ used by Tibshirani, Hastie and Friedman in the significance test for the gap statistic is calculated using a factor $1 + 1 / B$, not just $1/B$? — quant_dev, Oct 15 '17 at 04:47

score 0 · Answer 4 · answered Oct 27 '15 at 18:31

If you are working in a metric space you can embed description length in the farthest point clustering heuristic to do this automatically. Code is available on the Mathworks file exchange for clustering Gaussian PDFs. Google "track fusion MDL" to find the url and a white paper.

Determine the number of clusters for K-means automatically

4 Answers4

Note:

Linked

Related