0

I see a couple of questions on stats.SE regrading finding the optimal number of clusters for a clustering problem.

However I do not want the clustering algorithm to do that for me. Which algorithms I can use when I know the number of clusters in my data set?

pnp
  • 193
  • 2
  • 8
  • 1
    The general answer to your question: _any_. Any clustering method, whatever the algorithm, allows you to choose the number of clusters you want. Either by specifying that number in advance (such as in k-means method) or selecting this number afterwards (such as in hierarchical method). However, the _problem_ to consider here is how various methods deal with outliers / noise data points. For example, k-means will classify out such points to your k clusters (thereby "spoiling" the clusters); hierarchical may keep such points as tiny clusters of their own (thus you will have k+ clusters). – ttnphns Dec 27 '13 at 09:54
  • 1
    So, the choice of the "right" clustering method (which is very important choice indeed) is not driven by the wish to get the _known_ number of clusters. It is driven by the nature of your data (such as numeric? binary?), assumptions about the distribution/shape of clusters (normal? globular? irregular?) and their juxtaposition (demarkated? fuzzy?). – ttnphns Dec 27 '13 at 10:04
  • @ttnphns actually I have to disagree. Neither DBSCAN nor MeanShift just to name some examples will allow you to choose the number of clusters either beforehand or afterwards. You will need to experimentally vary the parameters to get the desired number of clusters. – Has QUIT--Anony-Mousse Dec 28 '13 at 10:40
  • @Anony-Mousse, this does not contradict to what I said. I this case, you actually choose the number of clusters _by_ varying the parameters. – ttnphns Dec 28 '13 at 11:29
  • But even then, these algorithms may be unable to yield exactly `k` clusters. As a matter of fact, even hierarchical clustering may not have a cut that yields exactly `k` clusters, because of ties. Consider the iris data set. Due to the 0.1 measurement resolution, there are plenty of ties in this data set, in particular with Manhattan distance. – Has QUIT--Anony-Mousse Dec 28 '13 at 11:34

1 Answers1

1

Most likely, you are not looking for cluster analysis the first place.

See: cluster analysis is about discovering structure in your data. What if there is no structure? What if the data is just uniform random noise?

I don't like calling $k$-means a clustering algorithm. I prefer calling it a vector quantization algorithm. Because it will force your data to belong to $k$ groups, no matter what comes. It doesn't actually try to discover "structure" in your data, but it tries to minimize the mathematical value of "within cluster variance" under the constraint that you want to partition your data into $k$ partitions. The result usually "looks like" clustering, and thus the algorithm can often be used for this purposed, but this is not what the algorithm does from a mathematical point of view. It doesn't discover structure. It optimizes SSQ.

IMHO, cluster analysis starts when you have algorithms that A) can return "there are no clusters in this data" and B) that can discover overlapping structures as well as handling outliers that belong to no structure at all.

On uniform data or a single normal distribution, good "clustering" algorithms should return that there is only one big cluster; and at most drop some outliers. Anything else is IMHO a failure at discovering even this most simple structure of all.

When you know the number of clusters because you want to do vector quantization (e.g. reduce the number of colors to $k$, or represent the data using a codebook of $k$ words) then vector quantization algorithms may be what you are looking for. Recent advances in cluster anaysis (in particular in the data mining community) are based on various different paradigms (not just $k$-means), and more often than not will neither return a hard partitioning nor a representative object like the cluster mean. In particular, clusters may overlap and be concave nowadays.

If you have visually inspected your data (which is a good idea, if you can still visualize it), consider just using brushing (literally "painting" the clusters in the visualization!) to manually annotate your clusters instead of looking for a magic algorithm that does what you can easily select with your mouse.

Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96
  • Your definition of what is cluster analysis is rather your "IMHO". On one hand, `discovering structure in data` is too broad because "structure" is too wide a word and wider than "clusters". On the other hand, your 3rd paragraph is unjustifiably strict and narrow. I don't object though, because everybody's opinion/definition, including mine, would be biased. – ttnphns Dec 28 '13 at 11:43
  • It's definitely only one view of cluster analysis. Which is what I'm trying to convey. He seems to be looking for different clustering methods; and many of these will not be designed around the "vector quantization" point of view, but most of the recent clustering development is happening in the KDD community, that emphasizes the "knowledge discovery" aspect. To avoid confusion, I recommend using the more precise terms, e.g. vector quantization if your main interest is finding a finite set of representative values. – Has QUIT--Anony-Mousse Dec 28 '13 at 12:12
  • "_Recent advances in cluster anaysis_" ... you could name a few here, or give some links? – pnp Dec 30 '13 at 04:20
  • You may want to get the book [Cluster Analysis (TOC)](http://charuaggarwal.net/clusterbook.pdf) by C.C.Aggarwal, just out. I believe it covers many newer methods. – Has QUIT--Anony-Mousse Dec 30 '13 at 10:07