How do I interpret the Mann-Whitney U when using R's formula interface

Question

Say we have the following data:

set.seed(123)
data <- data.frame(x = c(rnorm(50, 1, 1), rnorm(50, 5, 2)),
                   y = c(rep('A', 50),    rep('B', 50)))

Which yields the following boxplot (boxplot(data$x ~ data$y)):

boxplot

Now let's say I want to test if the two samples have the same location parameters (median and/or mean). In my real case, the data are clearly not normal, so I've decided to run the Wilcoxon-Mann-Whitney test, like this:

wilcox.test(data$x ~ data$y)

However, I would like the alternative hypothesis to be that B, data$y's "second" factor, comes from a distribution with higher position parameters. I've tried setting the alternative parameter to "greater" and "less", but apparently the alternative hypotheses are not what I'm looking for. For example, alternative = "greater" tells me "alternative hypothesis: true location shift is greater than 0"; alternative = "less" tells me "alternative hypothesis: true location shift is less than 0".

How can I tweak the wilcox.test() function in order to have the alternative hypothesis I want (B comes from a distribution with higher position parameters than A)? Or should I just use another test instead?

In what sense aren't your data normal. Based on the boxplots (possibly not the best way to decide, but what's there) they certainly look normal enough. Moreover, you *generated* your data w/ `rnorm()`, so **they have to be normal**. I wonder if you're confused about the nature of the assumption of normality; it may help you to read this thread: [What if residuals are normally distributed but y is not](http://stats.stackexchange.com/questions/12262/). — gung - Reinstate Monica, Jul 25 '13 at 14:49
I am just expanding on @Roland's point but why do you think there is a problem? It seems to give you exactly what you want. — Gala, Jul 25 '13 at 14:52
@gung I think the phrase “in my real case” implies that the OP is in fact interested in another data set than the one he created for the question. — Gala, Jul 25 '13 at 14:54
Hmmm, I must have missed that, @GaëlLaurans. Thanks for pointing that out. — gung - Reinstate Monica, Jul 25 '13 at 15:00
Sorry, I was just too much in a hurry to think of a way of creating asymmetric mock data that would look more like my real case. Feel free to edit the post and substitute those `rnorm()`s for something less well-behaved. — Waldir Leoncio, Jul 25 '13 at 15:07
What do you understand to be the difference between "comes from a distribution with higher position parameters" & "true location shift is greater than 0"? — Scortchi - Reinstate Monica, Jul 25 '13 at 16:19
Honestly, @Scortchi, I understand "true location shift is greater than 0" as E(X|Y = A) <> E(X|Y = B), but not as that difference having a direction. But since the other alternative would be "alternative hypothesis: true location shift is less than 0", I guess R is already inplying a direction, I'm just failing to see which direction it is. — Waldir Leoncio, Jul 25 '13 at 17:33
@Roland, I'm interpreting "location shift" as the difference between E(X|Y = A) and E(X|Y = B), is that correct? — Waldir Leoncio, Jul 25 '13 at 17:34
The Wilcoxon-Mann-Whitney test is sensitive to more general kinds of difference than a straight location shift; for example, with positive values, its equally sensitive to a scale-shift (taking logs converts the scale shift to a location shift, but the WMW statistic is the same). You can even treat a one sided alternative as general as $P(X>Y)>\frac{1}{2}$ for example (e.g. see Conover's *Practical Nonparametric Statistics*). — Glen_b, Jul 25 '13 at 23:14
(ctd)... On the other hand, you said at one point "* I want to test if the two samples come from the same distribution*"; since there are more ways for that to be false than a tendency for one variable to be higher (e.g. a shift in variability with similar locations or a change in skewness or in peakedness), if you really just want to test for equality of distributions vs inequality of them you should probably consider a two sample Kolmogorov-Smirnov. If you are interested in a 'tends to be greater' alternative, then WMW should be okay. — Glen_b, Jul 25 '13 at 23:16
Thank you for your valuable input, Glen. I will edit my OP, since what I really want is to compare the distributions' position parameteres, not if they come from the same family of distributions. — Waldir Leoncio, Jul 26 '13 at 16:04

Gala · Accepted Answer · 2013-07-26T15:11:11.743

Technically, the reference category and the direction of the test depend on the way the factor variable is encoded. With your toy data:

> wilcox.test(x ~ y, data=data, alternative="greater")

    Wilcoxon rank sum test with continuity correction

data:  x by y 
W = 52, p-value = 1
alternative hypothesis: true location shift is greater than 0 

> wilcox.test(x ~ y, data=data, alternative="less")

    Wilcoxon rank sum test with continuity correction

data:  x by y 
W = 52, p-value < 2.2e-16
alternative hypothesis: true location shift is less than 0

Notice that the W statistic is the same in both cases but the test uses opposite tails of its sampling distribution. Now let's look at the factor variable:

> levels(data$y)
[1] "A" "B"

We can recode it to make "B" the first level:

> data$y <- factor(data$y, levels=c("B", "A"))

Now we have:

> levels(data$y)
[1] "B" "A"

Note that we did not change the data themselves, just the way the categorical variable is encoded “under the hood”:

> head(data)
          x y
1 0.4395244 A
2 0.7698225 A
3 2.5587083 A
4 1.0705084 A
5 1.1292877 A
6 2.7150650 A

> aggregate(data$x, by=list(data$y), mean)
  Group.1        x
1       B 5.292817
2       A 1.034404

But the directions of the test are now inverted:

> wilcox.test(x ~ y, data=data, alternative="greater")

    Wilcoxon rank sum test with continuity correction

data:  x by y 
W = 2448, p-value < 2.2e-16
alternative hypothesis: true location shift is greater than 0

The W statistic is different but the p-value is the same than for the alternative="less" test with the categories in the original order. With the original data, it could be interpreted as “the location shift from B to A is less than 0” and with the recoded data it becomes “the location shift from A to B is greater than 0” but this is really the same hypothesis (but see Glen_b's comments to the question for the correct interpretation).

In your case, it therefore seems that the test you want is alternative="less" (or, equivalently, alternative="greater" with the recoded data). Does that help?

Mm, sounds like you're onto something there, Gaël. I'll study your answer and get back, thanks for the help! — Waldir Leoncio, Jul 25 '13 at 20:42
Ok, so I guess "greater" in this case is always in reference to the "first" level, right? Ok, that helps and I think it solves the case. Thanks again! — Waldir Leoncio, Jul 29 '13 at 14:30
I just ran into this precise problem. Thanks for the excellent explanation! — Davy Kavanagh, Sep 11 '13 at 21:58

How do I interpret the Mann-Whitney U when using R's formula interface

1 Answers1

Linked

Related