Mann-Whitney U test when only summary data (mean, sd, sample size) are available

Question

When raw data is not available, which normally occurs when we obtain our data from different scientific papers, how is it possible to perform a Mann-Whitney analysis to test whether two samples are statistically different when only means, sd's and sample sizes are available?

I would like to solve this question by using R commander.

For example:

Sample 1: mean1=2.5; sd1=0.25; n1=4

Sample 2: mean2=3.1; sd2=0.33; n2=8

Are they statistically different when a M-W is performed?

You can't. The test uses ranks and you don't have any information about those. The data is only sufficient for doing a t-test. — Roland, Dec 10 '14 at 08:59

score 3 · Answer 1 · edited Apr 13 '17 at 12:44

As already noted in the comments, you can't perform a Wilcoxon-Mann-Whitney U test on such summary data as don't know what the ranks are like. I can show you two data sets with the same summary statistics as you require, but which have very different results under a U test. This should demonstrate beyond doubt the impossibility of your requirement.

Firstly, {2.132241, 2.581832, 2.691334, 2.594593} and {2.572249, 3.221167, 3.310597, 3.288375, 3.147697, 2.580609, 3.305366, 3.373940}. These have the required means, standard deviations and sample sizes. The U test would have a p-value of 0.1091.

Secondly, {2.133137, 2.610204, 2.690900, 2.565759} and {2.911540, 3.533921, 2.897434, 3.453944, 2.744262, 3.441276, 2.741388, 3.076235}. The U test would have a p-value of 0.00404.

In other words, the provided summary statistics (which concern the moments of the data) are consistent with data sets which have got very different rank structures. So while you could perform a t-test from the given information, you can't perform a U test.

Appendix

If you are curious as to where I procured my "data", they were constructed from normal random deviates and rescaled to have the mean and standard deviation that you specified.

set.seed(33504)
x1 <- rnorm(n=4)
x1 <- 0.25*(x1-mean(x1))/sd(x1) + 2.5
x2 <- rnorm(n=8)
x2 <- 0.33*(x2-mean(x2))/sd(x2) + 3.1

These data give:

>     x1
[1] 2.132241 2.581832 2.691334 2.594593
>     mean(x1)
[1] 2.5
>     sd(x1)
[1] 0.25
>     x2
[1] 2.572249 3.221167 3.310597 3.288375 3.147697 2.580609 3.305366 3.373940
>     mean(x2)
[1] 3.1
>     sd(x2)
[1] 0.33
>     wilcox.test(x1, x2)

        Wilcoxon rank sum test

data:  x1 and x2 
W = 6, p-value = 0.1091
alternative hypothesis: true location shift is not equal to 0

But with a different seed:

set.seed(14)
x1 <- rnorm(n=4)
x1 <- 0.25*(x1-mean(x1))/sd(x1) + 2.5
x2 <- rnorm(n=8)
x2 <- 0.33*(x2-mean(x2))/sd(x2) + 3.1

Different data, same means and standard deviations, and a different result!

>     x1
[1] 2.133137 2.610204 2.690900 2.565759
>     mean(x1)
[1] 2.5
>     sd(x1)
[1] 0.25
>     x2
[1] 2.911540 3.533921 2.897434 3.453944 2.744262 3.441276 2.741388 3.076235
>     mean(x2)
[1] 3.1
>     sd(x2)
[1] 0.33
> 
>     wilcox.test(x1, x2)

        Wilcoxon rank sum test

data:  x1 and x2 
W = 0, p-value = 0.00404
alternative hypothesis: true location shift is not equal to 0

However, if you do a t.test(x1, x2) you will see the results match, since the given summary statistics, which the "data" have been designed to match, are sufficient to calculate the t statistic.

(1) If you're using a different seed, the data are *different*, right? (2) Since your simulated data clearly are not drawn iid from any population (due to the post-processing), it is not evident that either the Wilcoxon or a t test would apply to them. The underlying idea is nevertheless correct: these moment-based statistics are consistent with radically different Wilcoxon test statistics. But trying to make that case in the way you do here introduces so many extraneous elements that it does not adequately convey the insight you have. — whuber, Dec 15 '14 at 15:29
@whuber I think your pedagogical point is correct so I have reordered the material. The main thrust is simply that the summary statistics are perfectly consistent with data sets with different ranking structure, and this doesn't rely on the simulation at all, but showing possible numbers should make the result more evident. If I had pulled those numbers from thin air, and asked "what if your paper's data looked like *this*? But then again, what if it looked like *this*?" that would have proved the point. But I felt I shouldn't leave the mystery of where my "data" came from. — Silverfish, Dec 15 '14 at 15:58
Thanks for addressing this (+1). Another way to approach the problem is to use small, simple datasets to illustrate. All you need to create enormous changes in the moments is modify one of the most extreme values in the combined dataset in a way that does not change the Wilcoxon statistic. By using simple numbers in very small amounts you can help the reader focus on the underlying idea without being distracted by inessential details. — whuber, Dec 15 '14 at 16:02
@whuber Yes, if the OP hadn't clearly had a particular data set in mind, I would have opted for something simpler! But I'm not sure I follow your method: wouldn't you end up with two data sets giving the "same U, different t" while what we are trying to demonstrate to the OP is rather that there could be data sets with "same t, different U"? — Silverfish, Dec 15 '14 at 16:16
@whuber My preferred solution strategy to construct such a small data set might be to set it up with "almost ties" between the two groups, so that a small shift in the first group can change several ranks. This relates to a second point I considered addressing, but didn't, which is that the published summary statistics are only accurate up to rounding. A mean of 2.5 could be from 2.45 to 2.55. Hence, a sufficiently small translation would still be consistent with the published figures, but could change the rank statistics and hence the U result. — Silverfish, Dec 15 '14 at 16:20

Mann-Whitney U test when only summary data (mean, sd, sample size) are available

1 Answers1