1

I would like to reproduce the results of the Wilcoxon-Mann-Whitney-test in R wilcox.test(x,y,paired=FALSE, conf.int=TRUE.

I succeed to get the W-value but didn't succeed to reproduce difference in location.

Help says:

Note that in the two-sample case the estimator for the difference in location parameters does not estimate the difference in medians (a common misconception) but rather the median of the difference between a sample from x and a sample from y.

Sincerly, I don't understand sample from x? Does that mean that the location difference was simulated using samples from x and y? And what is the difference between a sample from x and and a sample from y? That is what is the difference between two vectors?

I prepared an example:

# -- Create data
A <- c(7,14,22,36,40,48,49,52)
n1 <- length(A)
B <- c(3,5,6,10,17,18,20,39)
n2 <- length(B)

# -- Do some processing
All <- c(A,B)
grp <- c(rep(1,n1), rep(2,n2))
rnk <-rank(All)
xdata <- matrix(c(grp,All,rnk), ncol=3)
names <- c("group","value","rank")
dimnames(xdata) <- list(NULL,names)
t(xdata)

Here is the new data structure:

      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16]
  group    1    1    1    1    1    1    1    1    2     2     2     2     2     2     2     2
  value    7   14   22   36   40   48   49   52    3     5     6    10    17    18    20    39
  rank     4    6   10   11   13   14   15   16    1     2     3     5     7     8     9    12

Computing statistics:

data <- as.data.frame(xdata)

# -- sum of ranks
(r1 <- with(data,sum(rank[group==1])))
(r2 <- with(data,sum(rank[group==2])))

# -- statistics
(u1 <- r2-n2*(n2+1)/2)  # u1=11
(u2 <- r1-n1*(n1+1)/2)  # u2=53

# -- Test
wilcox.test(A,B, paired=FALSE, conf.int = TRUE)

Output of test:

Wilcoxon rank sum test

data:  A and B
W = 53, p-value = 0.02813
alternative hypothesis: true location shift is not equal to 0
95 percent confidence interval:
  2 35
sample estimates:
difference in location 
                  19.5  

W is

x <- wilcox.test(A,B, paired=FALSE, conf.int = TRUE)
x$statistic

The result W=53 I can get manaully from

max(c(u1,u2))  # max of 53 and 11

I'm just wondering how I can get

x$estimate

difference in location = 19.5 from the data above.

mdewey
  • 16,541
  • 22
  • 30
  • 57
giordano
  • 829
  • 8
  • 19
  • 1
    Please search through Q/A of this site about Mann-Whitney test. Pay attention to such things as "difference in mean ranks" and "Hodges-Lehmann 2-sample estimator". – ttnphns Feb 25 '16 at 19:27

1 Answers1

1

Thanks to ttnphns I could find the solution: it is the Hodges-Lehmann 2-sample estimator. Here is shown how to compute it. Here is the code for my case:

> names(A) <- A # not necessary 
> names(B) <- B # not necessary 

# Cartesian product of the difference of this two vectors
> outer(A,B,"-")
    3  5  6 10  17  18  20  39
7   4  2  1 -3 -10 -11 -13 -32
14 11  9  8  4  -3  -4  -6 -25
22 19 17 16 12   5   4   2 -17
36 33 31 30 26  19  18  16  -3
40 37 35 34 30  23  22  20   1
48 45 43 42 38  31  30  28   9
49 46 44 43 39  32  31  29  10
52 49 47 46 42  35  34  32  13

# median of the matrix
> median(outer(A,B,"-"))
[1] 19.5

The difference of the two medians is:

> median(A) - median(B)
[1] 24.5

Quite a large difference.

How to explain to a non-statistician why to use the 2-sample Hodges-Lehmann-estimator for the difference of A and B instead of the difference of the median of A and B?

giordano
  • 829
  • 8
  • 19