How to choose the right distance matrix for clustering?

Question

I am attempting simple Ward type clustering. However, the R package is proving several choices to use for the distance matrix. I am wondering how I am supposed to determine the right distance matrix method.

Are there any generally acceptable criteria for specific sets of problems?

Please search `Ward clustering` on this site; there have been a number of answers already. Among them my own are http://stats.stackexchange.com/a/53417/3277, http://stats.stackexchange.com/a/13889/3277 — ttnphns, Jul 27 '14 at 19:38
@ttnphns: thank you. I have noted that Euclidean distance matrix is recommended for Ward. But when am I allowed to use an of the others such as "manhattan", "canberra", "binary", "minkowski", "correlation", "uncentered" or "abscor". BTW: I am working with pvclust — RndmSymbl, Jul 27 '14 at 19:48
You may use any distance. However, only (squared) Euclidean is fully correct with Ward. — ttnphns, Jul 27 '14 at 19:52
Are there possibly any references/sources to study as to why that is the case and possibly covering advantages/disadvantages of other methods? — RndmSymbl, Jul 27 '14 at 19:55
It's simple: Ward's method (and centroid, and so called "median" methods) are involved in computing geometrical centroids in euclidean space. They do it in a way that requires squared euclidean distances. — ttnphns, Jul 27 '14 at 20:07

score 3 · Accepted Answer · answered Jul 28 '14 at 10:35

Some algorithms have design limitations.

k-means and Ward are both designed for squared Euclidean distance (= sum of variances).

Others require triangle inequality for correctness.

Again others (single-link) do not even need your distance to be non-negative... they can work with similarities or dissimilarities on any scale, as long as they know whether to prefer low or prefer high values.

However, don't just choose the distance by the algorithm. Instead, choose the algorithm by the distance, and the distance must match your task.

On any numerical data set you can compute squared Euclidean, and run k-means. But the results may be completely useless. So first, study your data set, in particular how to quantify similarity. Without a working similarity measure, any clustering algorithm will only work by chance.

How to choose the right distance matrix for clustering?

1 Answers1