10

Say we have the following data:

set.seed(123)
data <- data.frame(x = c(rnorm(50, 1, 1), rnorm(50, 5, 2)),
                   y = c(rep('A', 50),    rep('B', 50)))

Which yields the following boxplot (boxplot(data$x ~ data$y)):

boxplot

Now let's say I want to test if the two samples have the same location parameters (median and/or mean). In my real case, the data are clearly not normal, so I've decided to run the Wilcoxon-Mann-Whitney test, like this:

wilcox.test(data$x ~ data$y)

However, I would like the alternative hypothesis to be that B, data$y's "second" factor, comes from a distribution with higher position parameters. I've tried setting the alternative parameter to "greater" and "less", but apparently the alternative hypotheses are not what I'm looking for. For example, alternative = "greater" tells me "alternative hypothesis: true location shift is greater than 0"; alternative = "less" tells me "alternative hypothesis: true location shift is less than 0".

How can I tweak the wilcox.test() function in order to have the alternative hypothesis I want (B comes from a distribution with higher position parameters than A)? Or should I just use another test instead?

Waldir Leoncio
  • 2,137
  • 6
  • 28
  • 42
  • 3
    Think about what "location shift" means. – Roland Jul 25 '13 at 14:04
  • 1
    In what sense aren't your data normal. Based on the boxplots (possibly not the best way to decide, but what's there) they certainly look normal enough. Moreover, you *generated* your data w/ `rnorm()`, so **they have to be normal**. I wonder if you're confused about the nature of the assumption of normality; it may help you to read this thread: [What if residuals are normally distributed but y is not](http://stats.stackexchange.com/questions/12262/). – gung - Reinstate Monica Jul 25 '13 at 14:49
  • 1
    I am just expanding on @Roland's point but why do you think there is a problem? It seems to give you exactly what you want. – Gala Jul 25 '13 at 14:52
  • 1
    @gung I think the phrase “in my real case” implies that the OP is in fact interested in another data set than the one he created for the question. – Gala Jul 25 '13 at 14:54
  • Hmmm, I must have missed that, @GaëlLaurans. Thanks for pointing that out. – gung - Reinstate Monica Jul 25 '13 at 15:00
  • Sorry, I was just too much in a hurry to think of a way of creating asymmetric mock data that would look more like my real case. Feel free to edit the post and substitute those `rnorm()`s for something less well-behaved. – Waldir Leoncio Jul 25 '13 at 15:07
  • 1
    What do you understand to be the difference between "comes from a distribution with higher position parameters" & "true location shift is greater than 0"? – Scortchi - Reinstate Monica Jul 25 '13 at 16:19
  • 1
    Honestly, @Scortchi, I understand "true location shift is greater than 0" as E(X|Y = A) <> E(X|Y = B), but not as that difference having a direction. But since the other alternative would be "alternative hypothesis: true location shift is less than 0", I guess R is already inplying a direction, I'm just failing to see which direction it is. – Waldir Leoncio Jul 25 '13 at 17:33
  • @Roland, I'm interpreting "location shift" as the difference between E(X|Y = A) and E(X|Y = B), is that correct? – Waldir Leoncio Jul 25 '13 at 17:34
  • 3
    The Wilcoxon-Mann-Whitney test is sensitive to more general kinds of difference than a straight location shift; for example, with positive values, its equally sensitive to a scale-shift (taking logs converts the scale shift to a location shift, but the WMW statistic is the same). You can even treat a one sided alternative as general as $P(X>Y)>\frac{1}{2}$ for example (e.g. see Conover's *Practical Nonparametric Statistics*). – Glen_b Jul 25 '13 at 23:14
  • 2
    (ctd)... On the other hand, you said at one point "* I want to test if the two samples come from the same distribution*"; since there are more ways for that to be false than a tendency for one variable to be higher (e.g. a shift in variability with similar locations or a change in skewness or in peakedness), if you really just want to test for equality of distributions vs inequality of them you should probably consider a two sample Kolmogorov-Smirnov. If you are interested in a 'tends to be greater' alternative, then WMW should be okay. – Glen_b Jul 25 '13 at 23:16
  • Thank you for your valuable input, Glen. I will edit my OP, since what I really want is to compare the distributions' position parameteres, not if they come from the same family of distributions. – Waldir Leoncio Jul 26 '13 at 16:04

1 Answers1

9

Technically, the reference category and the direction of the test depend on the way the factor variable is encoded. With your toy data:

> wilcox.test(x ~ y, data=data, alternative="greater")

    Wilcoxon rank sum test with continuity correction

data:  x by y 
W = 52, p-value = 1
alternative hypothesis: true location shift is greater than 0 

> wilcox.test(x ~ y, data=data, alternative="less")

    Wilcoxon rank sum test with continuity correction

data:  x by y 
W = 52, p-value < 2.2e-16
alternative hypothesis: true location shift is less than 0 

Notice that the W statistic is the same in both cases but the test uses opposite tails of its sampling distribution. Now let's look at the factor variable:

> levels(data$y)
[1] "A" "B"

We can recode it to make "B" the first level:

> data$y <- factor(data$y, levels=c("B", "A"))

Now we have:

> levels(data$y)
[1] "B" "A"

Note that we did not change the data themselves, just the way the categorical variable is encoded “under the hood”:

> head(data)
          x y
1 0.4395244 A
2 0.7698225 A
3 2.5587083 A
4 1.0705084 A
5 1.1292877 A
6 2.7150650 A

> aggregate(data$x, by=list(data$y), mean)
  Group.1        x
1       B 5.292817
2       A 1.034404

But the directions of the test are now inverted:

> wilcox.test(x ~ y, data=data, alternative="greater")

    Wilcoxon rank sum test with continuity correction

data:  x by y 
W = 2448, p-value < 2.2e-16
alternative hypothesis: true location shift is greater than 0 

The W statistic is different but the p-value is the same than for the alternative="less" test with the categories in the original order. With the original data, it could be interpreted as “the location shift from B to A is less than 0” and with the recoded data it becomes “the location shift from A to B is greater than 0” but this is really the same hypothesis (but see Glen_b's comments to the question for the correct interpretation).

In your case, it therefore seems that the test you want is alternative="less" (or, equivalently, alternative="greater" with the recoded data). Does that help?

Gala
  • 8,323
  • 2
  • 28
  • 42