Chosing optimal k and optimal distance-metric for k-means

Question

I have a data-set with roughly 20-dimensions and millions of points which I want to cluster.

The goal is to find a set of clusters which:

Are as distinct as possible from each other (minimum coupling) between clusters, i.e. maximum average distance between centroids
With instances as similar as possible (maximum cohesion) inside the clusters, i.e. minimum average distance between the points within each cluster.

Without considering the domain, is there a good metric to help determine the optimal $k$ I should choose?

Intuitively, I would pick $k=\sqrt{N}$ for a data-set in two dimensions, and $k=\sqrt[M]{N}$ for a data-set with $M$ dimensions and $N$ data-points, but I have a hunch that there are better methods.

A related and complementary question is which distance metric to use. Due to the curse of dimensionality, I know that euclidean distance becomes a poor choice as the number of dimensions increases.

Note that this question is different than Choosing optimal K for KNN (this one asks about clustering rather than k-NN classification)

VTC as duplicate of [How to decide on the correct number of clusters?](http://stats.stackexchange.com/q/23472/1352), which I found through [this search](http://stats.stackexchange.com/search?q=%22number+of+clusters%22+is%3Aquestion). And please don't ask multiple different questions in one post. [This search](http://stats.stackexchange.com/search?q=distance+clustering+is%3Aquestion) may help you in your second question. — Stephan Kolassa, Apr 16 '15 at 07:41

score 4 · Accepted Answer · edited May 23 '17 at 12:39

There is plenty of literature on choosing k, such as the silhouette plot. This question has also been asked a dozen times.

Same for your second question - asked several times before:

Do not use k-means with other distance functions than sum-of-squares. It may stop converging. k-means is not distance based. It minimizes the very classic sum of squares. The mean function is an L2 estimator of centrality - if you want to use a different distance function, you need to choose cluster centers differently. This has been done, see k-medoids aka PAM.

Don't forget to spend a lot of time preprocessing your data and visualizing your results. People tend to neglect that, and get really bad results without noticing... again, see other questions here on k-means.

Thanks for the response. I found several related questions by searching for the kmeans tag. Some appear on the right as 'Related' but none of them answered these specific questions in a clear direct way. I now realize I should have asked about distance clustering rather than specifically about k-means. I'll look up Silhouette plots, thanks again. — arielf, Apr 16 '15 at 06:58

Chosing optimal k and optimal distance-metric for k-means

1 Answers1

Linked