How to assess repeatability of multivariate and method-specific outcomes?

Question

Method "A" describes biological samples using multivariate "fingerprints" that consist of about 30 different variables. Different variables show different typical distribution and many of them closely correlate one with another. From prior experience it is assumed that we cannot transform many of the variables to normal distribution.

Method "B" is designed to be an improved version of method "A" and we wish to compare the repeatabilty of these two methods. If we were dealing with single variable, we would perform independent analyses of several samples and use ANOVA in order to compare within-method to between-methods variability. But here we are dealing with multivariate outputs and we do not wish to perform one analysis per variable. What are the correct approaches to this question?

Resolution

The answer by gui11aume's answer, provides useful and valuable information. I will adapt the "downstream application" from gui11aume's answer following by 7 one-way analyses as suggested by AdamO.

(Here is my approach. Please let me know how legitimate it is.) What about using a robust dimensionality reduction method to reduce the multivariate data to a single dimension and analyzing it? — David D, Jun 11 '12 at 14:41
David, this problem sounds like you want to do a variance decomposition on a multivariate outcome but the title seems to indicate you're after something else. Can you clarify? Also, can you say anything more about the data you're analyzing? — Macro, Jun 15 '12 at 15:55
David, can you explain more explicitly what you mean by "repeatability"? I suspect it is similart to what we (my field is chemometric analysis of spectroscopic data sets [biological samples]) usually call stability (of sth. wrt. sth.), e.g.: stability of predictions or model parameters (two very distinct types of stability!) wrt. to new samples / exchanging 10% of the samples, ... — cbeleites unhappy with SX, Jun 18 '12 at 12:56
Also, are the 30 output variables the same (theoretically) for both methods? — cbeleites unhappy with SX, Jun 18 '12 at 13:09
Wrt. to your dimensionality reduction: you'd run the risk of measuring more the characteristics of the dimensionality reduction method than of the input to it. Certainly you'll loose any information that is orthogonal to the direction captured by the one retained dimension. — cbeleites unhappy with SX, Jun 18 '12 at 13:12

gui11aume · Accepted Answer · 2012-06-18T09:50:41.593

This reminds me of cancer diagnostics, where old gene expression signatures are replaced by newer ones, that are of course supposed to be better. But how to show that they are better?

Here are a couple of suggestions to compare the repeatability of the methods.

1. Use co-inertia analysis (CIA).
CIA should be more advertised, unfortunately it is not widely used (no Wikipedia page for example). CIA is a two-table method that works on the same principle as canonical analysis (CA), which is to look for a pair of linear scores with maximum correlation between two sets of multi-densional measurements. Its advantage over CA is that you can do it even if you have more dimensions than observations. You could measure both methods on the same samples to get two coupled tables of 30 columns and $n$ observations. The first pair of principal components should be strongly correlated (if methods really measure the same thing). If method B is better, the residual variance should be smaller than the residual variance of method A. With this approach you address both the agreement of the methods, and their disagreement, which you interpret as noise.

2. Use a distance.
You could use the Euclidean distance in 30 dimensions between the test and the retest to measure the repeatability of a method. You generate a sample of that score for each method and you can compare the samples with the Wilcoxon test.

3. Use downstream application.
You are probably getting these fingerprints to take a decision, or classify patients or biological material. You can count the agreements vs disagreements between tests and retests for both methods and compare them with the Wilcoxon test.

Method 3 is the simplest, but also the most down to earth. Even for high dimensional inputs, decisions are usually quite simple. And however complex our problem is, bear in mind that statistics is the science of decision.

Regarding the question in your comment.

What about using a robust dimensionality reduction method to reduce the multivariate data to a single dimension and analyzing it?

Dimensionality reduction, however robust, will be associated with a loss of variance. If there is a way to transform your multivariate fingerprint into a single score capturing almost all of its variance, then sure, this is by far the best thing to do. But then why is the fingerprint multivariate in the first place?

I assumed from the context of the OP that the fingerprint is multivariate precisely because it is hard to reduce its dimensionality further without losing information. In that case, their repeatability on a single score does not have to be a good proxy for the overall repeatability, because you may neglect the majority of the variance (close to 29/30 in the worst case).

1. you are almost right about the application of this test. 2. Regarding Mahalanobis distance, I don't understand how can it be used to assess repeatability. Do you suggest computing covariance matrix for all the points in all the methods TOGETHER and then comparing the methods by sampling MD using that matrix? 3. Downstream application is indeed a valuable option, however it will not reduce the dimensionality t — David D, Jun 18 '12 at 06:03
Regarding point 2. you are right that it is difficult to apply the Mahalanobis distance. I removed it from the answer. — gui11aume, Jun 18 '12 at 09:32
@gui11aume: the multivariate input may be multivariate because it is raw measured data, i.e. variates = measurement channels (of a sensor array, spectrometer, ...). In this case, the multivariate nature comes from the nature of the measurement (though from another point of view usually a certain dimension reduction is already applied in the form of selecting *this* sensor chip or *this* particular spectral range) — cbeleites unhappy with SX, Jun 18 '12 at 12:58
@gui11aume: Also use your 3rd approach to compare classifiers. But: I read from the question and the comment about dimensionality reduction that this downstream application (which in fact *is* a drastic dimenasionality reduction) is probably not available (or at least the 30 variates themselves should be compared). — cbeleites unhappy with SX, Jun 18 '12 at 14:22
@gui11aume: a disctance measures similarity, but IMHO you also need to check the direction of the deviations, which is lost by the distance. — cbeleites unhappy with SX, Jun 18 '12 at 22:38
thank you for the valuable insight, interesting reference to CIA and for being the first to answer this question. — Boris Gorelik, Jun 21 '12 at 17:36

cbeleites unhappy with SX · Answer 2 · 2012-06-18T14:24:10.237

I Assume from your question and comment that the 30 output variables can not (easily) or should not be transformed to a single variate.

One idea to deal with data of $\mathbf{X_A}^{(n \times p_A)} \leftrightarrow \mathbf{X_B}^{(n \times p_B)}$ is that you could do regression of $\mathbf{X_A}^{(n \times p_A)} \mapsto \mathbf{X_B}^{(n \times p_B)}$ and vice versa. Additional knowledge (e.g. that variate $i$ in set A corresponds to variate $i$ also in set B) can help to restrict the mapping model and/or with the interpretation.

So what about multi block PCA (or -PLS) which take this idea further? For these methods, both multivariate fingerprints for the same samples (or same individuals) are analyzed together as independent variables, with or without a third dependent block.

R. Brereton: "Chemometrics for Pattern Recognition" discusses some techniques in the last chapter ("Comparing Different Patterns") and googling will lead you to a number of papers, also introductions. Note that your situations sounds similar to problems where e.g. spectroscopic and genetic measurements are analysed together (two matrices with a row-wise correspondence as opposed to analyzing e.g. time series of spectra where a data cube is analysed).

Here's a paper dealing with multi-block analysis: Sahar Hassani: Analysis of -omics data: Graphical interpretation- and validation tools in multi-block methods.

Also, maybe this is a good starting point into another direction: Hoefsloot et.al., Multiset Data Analysis: ANOVA Simultaneous Component Analysis and Related Methods, in: Comprehensive Chemometrics — Chemical and Biochemical Data Analysis(I don't have access to it, just saw the abstract)

score 1 · Answer 3 · answered Jun 15 '12 at 17:11

30 one way analyses is certainly an option and would be an ideal "table 2" type of analysis, in which an overall performance is summarized in a logical way. It may be the case that Method B produces the first 20 factors with slightly improved precision whereas the last 10 are wildly more variable. You have the issue of inference using a partially ordered space: certainly if all 30 factors are more precise in B, then B is a better method. But there is "grey" area and with the large number of factors, it's almost guaranteed to show up in practice.

If the objective of this research is to land on a single analysis, it's important to consider the weight of each outcome and their endpoint application. If these 30 variables are used in classification, prediction, and/or clustering of observational data, then I would like to see validation of these results and a comparison of A / B in classification (using something like risk stratification tables or mean percent bias), prediction (using the MSE), and clustering (using something like cross validation). This is the proper way of handling the grey area in which you can't say B is better analytically, but works much better in practice.

score 1 · Answer 4 · answered Jun 16 '12 at 14:33

1

I will try a multivariate ANOVA based on permutation (PERMANOVA) tests appoach. An ordination analisis (based on result on gradient length analysis) could also help.

answered Jun 16 '12 at 14:33

AnastD

310
2
9

1

In R there is the function adonis in the package Vegan that performs permutational multivariate ANOVA. This will generate a statistical test to tell you whether method A is different from method B. This package comes out of plant ecology where you count multiple species (the variables) in different small plots. Related to this is AMOVA, [analysis of molecular variance](http://en.wikipedia.org/wiki/Analysis_of_molecular_variance), where the variables are molecular data. For this you can use the R package ade4, but there is other free and online software you can find at the link. – Jdub Jun 18 '12 at 14:41

Michael R. Chernick · Answer 5 · 2012-06-18T20:41:23.973

0

If you could assume multivariate normality (which you said you could not) you could do a Hotelling T2 test of equality of mean vectors to see if you could claim differences between distributions or not. However although you can't do that you can still theoretically compare the distributions to see if they differ much. Divide the 30 dimensional space into rectangular grids. Use these as 30 dimensional bins. Count the number of vectors falling into each bin and apply a chi square test to see if the distributions look the same. The problem with this suggestion is that it requires judiciously selecting the bins in order to cover the data points in an appropriate way. Also the curse of dimensionality makes it difficult to identify differences between the multivariate distributions without having a very large number of point in each group. I think suggestions that gui11aume gave are sensible. I don't think the others are. Since comparing the distributions is not feasible in 30 dimensions with a typical sample some form of valid comparison of the mean vectors would seem to me to be appropriate.

edited Jun 18 '12 at 20:41

answered Jun 15 '12 at 21:45

Michael R. Chernick

39,640
28
74
143

1

Hi, Michael. Do you mind clarifying what you're suggesting regarding binning? It *sounds* like you're suggesting binning each dimension separately and then classifying into bins. But, let's say we have *two* bins per dimension, that's $2^{30} > 10^9$ bins. That doesn't sound like a good candidate for a $\chi^2$ test. So, what *are* you suggesting? – cardinal Jun 15 '12 at 23:19
also, according to your suggestion, it is not clear how the binning should be done: should every bin have the same number of cases, same range, same log range etc? – Boris Gorelik Jun 16 '12 at 12:49
@cardinal No what I said was to construct 30 dimensional rectangular shaped bins. I do the usual chi-square test for comparing two distributions. – Michael R. Chernick Jun 16 '12 at 20:31
@bgbg It is always a judgement call as to how many bins to have and whether or not they should have equal size. I think one tties to keep the bin sizes the same and the number of bins should be chosen to try to reasonable represent the shape of the distribution. Too few bins hides the shape while two many creates too many sparse and empty bins. – Michael R. Chernick Jun 16 '12 at 20:35
@MichaelChernick: Please provide some detail in your answer on how you propose to do this to yield something that would lead to a valid $\chi^2$ test. (What I describe *does* yield rectangular bins of the simplest form such that no dimension of the data is completely ignored.) – cardinal Jun 16 '12 at 20:42
I imagine that there may be a need for a large number of bins but they can be concentrated in the region where the data fall. For the test we just need several high dimensional bins and not 2 for each dimension. – Michael R. Chernick Jun 17 '12 at 03:24
(-1) It is with considerable reluctance that I downvote, but I don't believe the answer as currently written sufficiently responds to the OP's question. A naive application of a $\chi^2$ test with data-driven bin selection is invalid and there are rather severe difficulties in the first place given the high dimensionality (even the simplest of schemes would require a sample size at least an order of magnitude or two larger than the Earth's human population). I am happy to remove the downvote upon further revision. Cheers. – cardinal Jun 17 '12 at 04:09
2

After giving this more thought I think that my recommendation would not work in high dimensions because (1) although a judicious choice of bins is practical in 1, 2 and possibly 3 dimensions, it does not seem to me that identifying such bins in 30 dimensions could be done (2) because of the curse of dimensionality even if such a selection could be achieved points in 30 dimensions spread out in such a way that it would be difficult to detect differences between the distributions without a very large number of points. So cardinal makes some good points. – Michael R. Chernick Jun 17 '12 at 11:49
However I do not agree with the statement above "A naive application of a χ2 test with data-driven bin selection is invalid". A data driven approach to bin selection for a chi-square test is not invalid. The difficulty is in achieving it and not its validity once achieved. So I will modify or delete my answer. – Michael R. Chernick Jun 17 '12 at 11:51
1

I should have been more specific; by "naive application", I meant that one cannot simply apply the standard test immediately. At the very least some adjustment for degrees of freedom must be made, though sometimes determining what the degrees of freedom should be is not a completely straightforward matter. – cardinal Jun 17 '12 at 13:14
Downvote removed. – cardinal Jun 17 '12 at 13:15

How to assess repeatability of multivariate and method-specific outcomes?

Resolution

5 Answers5

Linked