3

I have a data-set with roughly 20-dimensions and millions of points which I want to cluster.

The goal is to find a set of clusters which:

  • Are as distinct as possible from each other (minimum coupling) between clusters, i.e. maximum average distance between centroids
  • With instances as similar as possible (maximum cohesion) inside the clusters, i.e. minimum average distance between the points within each cluster.

Without considering the domain, is there a good metric to help determine the optimal $k$ I should choose?

Intuitively, I would pick $k=\sqrt{N}$ for a data-set in two dimensions, and $k=\sqrt[M]{N}$ for a data-set with $M$ dimensions and $N$ data-points, but I have a hunch that there are better methods.

A related and complementary question is which distance metric to use. Due to the curse of dimensionality, I know that euclidean distance becomes a poor choice as the number of dimensions increases.

Note that this question is different than Choosing optimal K for KNN (this one asks about clustering rather than k-NN classification)

arielf
  • 1,128
  • 11
  • 15
  • VTC as duplicate of [How to decide on the correct number of clusters?](http://stats.stackexchange.com/q/23472/1352), which I found through [this search](http://stats.stackexchange.com/search?q=%22number+of+clusters%22+is%3Aquestion). And please don't ask multiple different questions in one post. [This search](http://stats.stackexchange.com/search?q=distance+clustering+is%3Aquestion) may help you in your second question. – Stephan Kolassa Apr 16 '15 at 07:41

1 Answers1

4

There is plenty of literature on choosing k, such as the silhouette plot. This question has also been asked a dozen times.

Same for your second question - asked several times before:

Do not use k-means with other distance functions than sum-of-squares. It may stop converging. k-means is not distance based. It minimizes the very classic sum of squares. The mean function is an L2 estimator of centrality - if you want to use a different distance function, you need to choose cluster centers differently. This has been done, see k-medoids aka PAM.

Don't forget to spend a lot of time preprocessing your data and visualizing your results. People tend to neglect that, and get really bad results without noticing... again, see other questions here on k-means.

Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96
  • Thanks for the response. I found several related questions by searching for the kmeans tag. Some appear on the right as 'Related' but none of them answered these specific questions in a clear direct way. I now realize I should have asked about distance clustering rather than specifically about k-means. I'll look up Silhouette plots, thanks again. – arielf Apr 16 '15 at 06:58