1

I am looking for methods/metrics to compare data matrices, which originate from the same dataset projected on two different feature spaces.

For background: these are DNA sequencing data projected on two different gene catalogues. The catalogues likely have a significant degree of overlap, but the exact correspondence between them is unknown, due to lack of a single established standard in the field. There are about a hundred samples and about a thousand features in each matrix (the number of features is not the same).

One approach that I have tried is clustering the samples and visually examining the dendrograms. This could be taken further by using one of the available metrics for comparing dendograms. I looking for alternative methods of quantitative comparison.

Roger Vadim
  • 1,481
  • 6
  • 17
  • 2
    1) Could you say more specifically what kind of comparison you're interested in? What kind of output should it produce, and/or how do you want to use it? 2) Do you have both projections for each data point? Or are some points projected in one way and other points are projected in another? 3) What kind of features do the projections produce (e.g. real values, categorical, etc.)? Does the feature space have any special structure, distance metrics, etc.? – user20160 Mar 19 '21 at 15:27
  • @user20160 These are great questions! In fact, I probably needed some help in answering some of them, since, once the question is correctly formulated, the answer is usually obvious. I realize, that it might be a bit too open-ended for SE... – Roger Vadim Mar 22 '21 at 10:20

2 Answers2

1

Here Samy Bengio explains CCA which is what you may try first. It can give you the info how similar are two matrices residing in some spaces with one dimension in common.

CCA -- canonical correlation in the experimental context is to take two sets of variables and see what is common among the two sets. So it is general enough I would say.

R has the standard function cancor and several other packages, including CCA and vegan.

Good Luck
  • 293
  • 15
1

The solution that I adopted in practice was to compare the correlation matrices for the two datasets (calculated using pearson/spearma/kendall or any other correlation coefficient of choice.

It is possible to formulate this problems as rigorously testing the hypothesis that the two correlation matrices are identical (see here and here). But it is also possibly to use any of the available distance metrics for comparing two matrices, of which I found particularly useful the following one: $$ d(R_1, R_2) = 1 - \frac{\text{tr}(R_1\cdot R_2)}{||R_1||\cdot ||R_2||} $$ It is a generalization based of the cosine similarity, and it has an advantage of varying between $0$ and $1$, which is more intuitive than, e.g., the dissimilarity measured using Kullback-Leibler divergence.

Roger Vadim
  • 1,481
  • 6
  • 17
  • Great ricochet, $cos(a,b)=\frac{a^Tb} {||a|| \cdot ||b||}$ witch is a normalization of the Euclidean distance. – Good Luck Mar 22 '21 at 13:33
  • 1
    @GoodLuck I just wanted to reproduce the equation as it is cited in the post referenced. *Generalized* here refers to the fact that it is applied to matrices rather than vectors. Btw, I liked your answer and will look deeper into it. – Roger Vadim Mar 22 '21 at 13:35