3

I have a dataset of $n$ $p$-dimensional vectors (objects) that I want to cluster.

  1. One way to do this is to compute the ($n \times n$) correlation matrix $C$, then obtain a dissimilarity matrix, $D$, from $C$ such that each element $d_{ij}$ in $D$ is a function of the single element $c_{ij}$ in $C$ (e.g., $D=1-C$), and then cluster on $D$.

  2. Instead, I want to take $C$ and obtain $D$, such that each element $d_{ij}$ of $D$ is a function of the entire vectors $C_i$ and $C_j$ in $C$ (for example, $d_{ij}$ can be the Euclidean distance between $C_i$ and $C_j$).

Why do 2)?:

  • Approach 1) computes the dissimilarity (distance) between two input vectors $i$ and $j$ solely as a function of the similarity between $i$ and $j$; in contrast, approach 2) computes the dissimilarity (distance) between two input vectors $i$ and $j$ as a function of the similarity of the similarities between $i$ and all other vectors vs. $j$ and all other vectors.
  • On the problems with which I am working, approach 2) seems to perform better in practice.

What I am wondering is whether there is any reason why approach 2) would not be valid?

ttnphns
  • 51,648
  • 40
  • 253
  • 462
user20913
  • 31
  • 1
  • Plus to @Anony-Mousse's answer, please read [this](http://stats.stackexchange.com/questions/25908/how-to-determine-the-number-of-clusters-when-using-correlation-as-the-distance/25921#25921) and [this](http://stats.stackexchange.com/questions/36152/converting-similarity-matrix-to-distance-matrix/36158#36158) answers which show how and why Pearson _r_ can be geometrically correctly converted into euclidean distance. You can cluster based or _r_, of course, but clustering on euclidean _d_ is more flexible: more clustering techniques applicable and more sane interpretations possible. – ttnphns Feb 16 '13 at 22:28
  • Can you clarify `Instead, I want...obtain D , such that each element d_ij is a function of the entire vectors C_i and C_j`? Do you mean you want the $d_{ij}$ to incorporate _all n-1_ correlations of object _i_ and likewise of object _j_? If yes, how do you see it? Example? There must be some multivariate analysis done before - to fulfil your idea. – ttnphns Feb 16 '13 at 22:50
  • Thank you all for your comments. To clarify, here is a basic example. Let us have 4 objects with a correlation matrix C={C_1, C_2, C_3, C4}, where C_1={1,0.3,0.2,-0.5}, C_2={0.3,1,-0.2,0.9}, C_3={0.2,-0.2,1,0.5}, C_4={-0.5,0.9,0.5,1}. With approach 1, d_12=(1-c_12)=1-0.3=0.7. With approach 2, d_12=||C_1-C_2||=1.76 (approx.). So approach 2 still uses correlations, but it computes Euclidean distances (or some other metric) on pairs of correlation vectors. – user20913 Feb 16 '13 at 23:40
  • @ttnphns, just to point out, this is for agglomerative clustering. – user20913 Feb 17 '13 at 01:46
  • So you take corr. matrix as data matrix, and compute eucl. d between its columns (or rows, which is the same); that means that you aim to cluster objects based on similarity of their _correlational profiles_ rather than on magnitude of pairwise correlations. Well, you are in right to do it, but this is _totally different_ analysis and so words `seems to perform better in practice` become meaningless. Also, your d_12=||C_1-C_2||=1.76 partly mixes diagonal with off-diagonal elements: 1-0.3 and 0.3-1, which is questionnable. To my mind, only off-diagonal elements should be compared. – ttnphns Feb 17 '13 at 07:54
  • Thank you, @ttnphns, great comments. Unfortunately, for two objects, the only way I can think of for including their own pairwise correlation as part of the correlation profiles is to include the diagonal elements. With respect to 'better in practice', to clarify: I know the correct clustering for a set of the objects (based on a different, unrelated, metric), and this approach (2) seems to be able to more closely recapitulate the known relationships between the objects. – user20913 Feb 17 '13 at 14:36

1 Answers1

1

Actually most approaches that I know use distances, not correlations.

While computing the distance matrix means you need O(n^2) memory and runtime, and you often could do better, in particular R and Matlab users seem to like this approach a lot. Probably because they have fast routines for doing this particular task, and you would need to do the O(n log n) approaches yourself.

Either way, AFAICT 2) is actually the common approach, and 1) is a work-around to be able to use correlations, by transforming them either via 1-r or 1-r^2 into a dissimilarity matrix.

Logically it obviously does not play a role whether you compute the similarity and then convert it into a dissimilarity in two steps or compute 1-r immediately before storing it in the matrix. However, R and Matlab may have fast routines for computing covariance matrices; that are faster than computing 1-r yourself for every cell (read up on vectorized operations).

Whether 1) or 2) works better probably depends on whether Euclidean distance or Pearson Correlation dissimilarity is more useful for your problem. There are like 50+ other distances you could try.

Has QUIT--Anony-Mousse
  • 39,639
  • 7
  • 61
  • 96