3

These days I'm using a lot (and discovering) nice ways to use k-means' clustering. For example, clustering word embeddings (word2vec vectors) to find synonyms or clustering doc vectors (doc2vec) to classify topics.

But for these tasks I always need to specify the number of clusteres (the 'k'). Is there some unsupervised algorithm that estimates a good number of clusteres and (but not must do) do the clustering task?

denisb411
  • 153
  • 7
  • For soft clustering text documents, there is Latent Dirichlet Allocation that uses a hierarchical Dirichlet process to identify K. It is quite advanced and very interesting (as far as i understand). For a more general k-means kind of clustering, some researchers in my Uni came up with [adaptive weights clustering](https://bayesgroup.github.io/bmml_sem/2016/AWC_2016_Repino.pdf) – KenHBS Dec 20 '17 at 22:15
  • 1
    This question mixes two different in one: (i) clustering methods that do not require to specify K. (ii) ... that attempt selection of the "best" K automatically before or during the clustering process. – ttnphns Dec 21 '17 at 08:41
  • 1
    Given that you say "similar to k-means", you probably are interested in what I've marked as (ii). TwoStep cluster analysis tries to determine the best number K during its 1st step based on minimum spanning tree approach. The 2nd step is hierarchical clustering; however, one could implement to do k-means as the 2nd step, in principle. – ttnphns Dec 21 '17 at 08:52
  • 2
    But any "automatic" - i.e. beforehand - methods to determine K will always lose to the old and good "afterwards" method - it is when you do clusterings (maybe on a subsample) with different K and then select the best K [from some point of quality](https://stats.stackexchange.com/q/195456/3277). – ttnphns Dec 21 '17 at 08:57
  • 2
    I'm surprised no one has mentioned expectation-maximization for Gaussian mixtures, for instance. You generally need to specify $k$, but if it's too high, the algorithm will probably reveal it. If you're just getting into the field, it might be diffcult to understand, and it is more of a framework than a specific algorithm. However, it's a good thing to keep in mind if you want to gain a better understanding of clustering in general. – cangrejo Dec 21 '17 at 11:44
  • Thanks for the suggestion @broncoAbierto. I have about 7 months that I came to the area so there's a lot to learn about. I will check that. – denisb411 Dec 21 '17 at 11:50

5 Answers5

6

Yet another approach is to use the Mean Shift algorithm. You must specify a radius around points to explore, but it determines the number of clusters.

G5W
  • 2,483
  • 1
  • 9
  • 21
  • I found out that DBSCAN is giving better results for my case, but unfortunately I have to reduce dimension to 2 to give acceptable results. I'm doing this with t-SNE, do you advise another one? PCA was performing poorly... – denisb411 Dec 21 '17 at 15:28
  • t-SNE and PCA would be my first two choices. There is also Multi-Dimensional Scaling, but it is closely related to PCA. I would expect similar performance for PCA & MDS. But you have said little about your data or problem. Is Feature selection practical? Might it be that some features are more important than others and you can discard the less important ones before t-SNE/PCA type reduction? – G5W Dec 21 '17 at 16:11
  • No, I can't. That's vectors created by Doc2Vec and I don't have information about the relevant features. There's 200 features each document vector. I found out that PCA 99% of the features, t-SNE to 3 dimensions and then DBSCAN it gave good results. Still don't know if it will be always good when the quantity of data increase, and it will a lot. – denisb411 Dec 21 '17 at 17:20
4

The NASA Autoclass employs Bayesian approach to inferring not only the cluster centers but also the most likely number of clusters. A similar tool, NBCTK, uses more modern inference algorithms (e.g. Variational Bayes).

Some of the nice features include:

  1. Support for mixed data types (e.g. discrete and real-valued features)
  2. Handles missing values

It works fairly well on relatively small datasets (on my datasets). Some basic understanding how the algorithm works under the hood helps to use the tool to its best potential. Expect time to process even moderate size datasets to be significant.

Note: The most likely number clusters are found under the assumptions made by the model. One must not interpret the results as absolute truth.

Vladislavs Dovgalecs
  • 2,315
  • 15
  • 18
3

If you mean clustering without specifying number of clusters up front, you have a couple of approaches

  1. estimate $k$ - use elbow method or x-means. Also, bisecting kmeans doesn't need $k$ specified.
  2. use different clustering scheme altogether

If you choose 2 then you have further types of methods

Both hierarchical methods and density-based methods require some kind of distance parameter to be specified - the clusters are formed by grouping the points based on that distance.

Jakub Bartczuk
  • 5,526
  • 1
  • 14
  • 36
  • Thank you for the suggestion. I found a Python implementation of x-means but it does require a 'k_min', is this normal? Check that: https://github.com/mynameisfiber/pyxmeans/blob/master/pyxmeans/xmeans.py Liked the 'sklearn' way of implementation. – denisb411 Dec 21 '17 at 01:17
  • 2
    (+1) an example of a different clustering scheme that provides an estimate of $k$ is [spectral clustering](https://arxiv.org/pdf/0711.0189.pdf) with the eigengap. If the amount of noise is low relative to the clarity of the clusters then this can work well, but in my experience if there's a decent amount of noise (as there tends to be) then the spectrum looks pretty smooth and it's not clear what to do. Also there are usually similarity function / graph parameters and they can be just as hard to set as $k$ – jld Dec 21 '17 at 01:44
2

DBSCN, Affinity Propagation are clustering algorithms that don't require that you specify k.

Alternatively, you can iterate through multiple values of k and see which one gives you the best results using some similarity measure.

A 3rd approach is ask a domain expert in whichever domain you are working in to give you a rough estimate of how many clusters you should be expecting.

Skander H.
  • 10,602
  • 2
  • 33
  • 81
1

Density clustering methods (MeanShift, DBScan....) don't require to specify the number of clusters, you specify something like the order of magnitude of the distance of points considered as "connected". The parameter is (very) roughly speaking the expected radius of a cluster, and then it finds the best number of clusters.

Density algorithms tend to fail in large dimension because the space is almost empty everywhere (or alternatively the distances gather around a certain value in a way that is extremely difficult to compensate). Density algorithms thus often require dimension reduction first, unlike k-means. With word2vec you usually start with dimension 300, which might be a problem (to experiment). If the density algorithms fail, then you can reduce the dimension first, possibly drastically.

Benoit Sanchez
  • 7,377
  • 21
  • 43
  • Ok so the vectors has 200 dimensions each, how much dimensions do you think it's ideal to apply the Density methods? Also, I'm using t-SNE but just for plotting. Do you have an opinion of other better dimension reduction algorithm? – denisb411 Dec 21 '17 at 11:48
  • t-SNE is not recommended except for plotting (honestly I never tried for something else). I'd recommend positive matrix factorization (PMF). What dimension: I don't know. Try 200 first anyway. Then try 10, try 2... Then adjust. The fail of density clustering most often results in all points collapsing in a single cluster, or one/few points by cluster. It's most often blatant: changing the distance parameter just takes you from one extreme to the other. – Benoit Sanchez Dec 21 '17 at 11:54