Distance measure methods of R function dist() evaluation

Question

I want to compute the distance matrix for the columns on a 1000 x 230 matrix using the dist() function in R. Though, I am uncertain about which method to use.

I know the differences between the methods and how the algorithm works, but I would like to hear from you which one will you prefer when working on gene expression data sets. The values are z-scores derived from normalized gene expression data. That is, data are real-valued from -10 to 10 (roughly) and it is important if the value is negative or positive (means: if the gene is upregulated or downregulated).

Could be rephrased to make it more specific: Which of the methods dist() function supports could be the optimal to understand the differences and similarities for the columns of my matrix.

You may find a CV user who is knowledgeable in statistics and genomics here, but you'd probably be better off asking people who specialize in genomics. Perhaps the omics literature has some papers dedicated to this topic. I'm not saying this is off-topic, just trying to point you in the most helpful direction. — Sycorax, Sep 11 '15 at 16:14
Thanks. Sure I have some literature in the subject, yet my goal is to briefly discuss which method should be best using the dist() method. Thus, along "euclidean", "maximum", "manhattan", "canberra", "binary", "minkowski". — Kwnwps, Sep 11 '15 at 16:33

score 2 · Accepted Answer · edited Apr 13 '17 at 12:44

2

Well, lets work through this one-by-one.

Minkowski norms with $p<1$ are not true distance metrics, so those are out provided you wish to use a true distance metric. This leaves $L_p$ norms for $p\ge1.$
Euclidean ($L_2$) isn't great in high dimensions.
As $p$ gets smaller, it is less terrible in the sense of the curse of dimensional sense, so Manhattan ($L_1$) is a popular choice. That is, it tends not to be dominated by the dimension in which the difference between the two points is largest.
Likewise, the above observation excludes the $L_\infty$ norm.
Canberra is intended for nonnegative values, e.g. counts; your data may be negative, so it's excluded.
Binary is only defined for binary data, so it is excluded as well.

edited Apr 13 '17 at 12:44

Community

1

answered Sep 11 '15 at 16:41

Sycorax

76,417
20
189
313

+1. But what did you mean by `As p gets smaller [and is still >=1], it is less terrible`? Are you referring to the curse dimentionality or some other property? - I just did not understand. A distance itself is not terrible unless it is a night road through the woods. – ttnphns Sep 11 '15 at 20:54
Canberra distance is not only for counts. It is a great distance to cluster ranked data, such as, for example, sportsmen by their place taken (gold=1, silver=2, bronze=3, etc.). – ttnphns Sep 11 '15 at 20:56
Thanks for the answers. Thus, Manhattan should be the optimal choice as I understand. Yet still I don't exactly get, if my goal is to examine the differences between the rows, how Manhattan can be so much better than Euclidean? – Kwnwps Sep 11 '15 at 21:20
1

Any time you're faced with the problem of distance metrics in high dimensional space, you're not going to have a great time: all of your points are rapidly racing away from each other as you add dimensions. Notions of nearness in high dimensions defy our $\mathbb{R}^3$ intuitions. Euclidean distance squares the differences, but Manhattan just takes the absolute value; the result is that Euclidean distances are more strongly influenced by large componentwise differences. – Sycorax Sep 11 '15 at 21:24
Nice. I will go with the Manhattan choice. I have also the impression that it is the common choice in these kind of data, pretty much because of that you just explained. – Kwnwps Sep 11 '15 at 21:38
Why do you say that Canberra is intended for non-negative counts? The definition in Wikipedia says it's defined for any two n-dimensional vectors. – James Oct 14 '15 at 15:50

score 0 · Answer 2 · answered Sep 11 '15 at 16:43

You can forget about Canberra if your data assumes negative values.

But the distance to use depends on the purpose!

Are you doing prediction from your data? Then, you don't need an expert advice, cross-validating carefully over the distances will give you the distance that helps you achieve highest performance on your model.

Are you trying to visualize clusters the data? Then, once you obtained the distance matrix, plot the first axis of a Multidimensional Anlaysis (R has a lot of packages for this)

Are you trying to describe the data? Then you need to know the differences between every distance, and bear in mind that the conclusions you obtain will highly depend on the data. Euclidean metric may overweight large differences...

Thanks, I am trying to visualise clusters and the distance matrix in order to compare it with another matrix with the same column names but different values for the rows. — Kwnwps, Sep 11 '15 at 21:15

score 0 · Answer 3 · answered Sep 11 '15 at 17:13

0

I take it you work with fold changes, or log fold changes, hence the negative numbers. You could switch to positive ratios instead and use Canberra. I like it because it measures the relative distance, as opposed to absolute distance in Euclidean and such like.

answered Sep 11 '15 at 17:13

James

2,600
1
14
26

actually no, the values are z scores derived from normalised gene expression values. – Kwnwps Sep 11 '15 at 21:11
Then you have to think about the lack of "linearity" in the z scores. That is, in practical terms, the difference between z = 1.5 and 1 is not the same as the difference between 2.9 and 3.4 when you eventually decide what genes are differentially expressed. Correspondingly, why not convert z-scores to p-values and use Canberra. – James Sep 14 '15 at 15:57
can you explain the lack of linearity more or provide a reference? Moreover, pvalues of what pairwise test? – Kwnwps Oct 05 '15 at 11:25
Do you use z-values to compute p-values in the end? – James Oct 14 '15 at 15:44
I'm not sure I understand from which test you are proposing to take the p-values from. – Kwnwps Oct 15 '15 at 00:27
I wrongly assumed that you perform a gene differential expression test that generates z-values that are used for computing p-values and calling the differentially expressed genes. – James Oct 19 '15 at 14:18

Distance measure methods of R function dist() evaluation

3 Answers3