4

I am using the Wilcoxon Rank-Sum/Mann-Whitney U test to compare metrics between two groups. In almost all cases, I get huge values for the statistic and only p-values around zero. In one case I got reasonable values for p around 0.22, which raises the question if the other cases are actually around zero. The number of values in the distributions I compare ranges between 200k and 800k depending on the metric and can differ between the two groups.

The null hypothesis is that both groups are from the same distribution.

  • Is it possible to get p-values around zero with these tests and what does that mean?
  • Is there a cause which could explain this behavior?

Please comment if you need more information on my setup.

Edit 1:

The statistic values seem to be reasonable as they seem to be around the expected mean of 1/2 the product of the size of the two groups.

Edit 2:

It seems that I have very few very high outliers but they should not be an issue with the chosen test as described here: Do we need to worry about outliers when using rank-based tests?.

Pitouille
  • 1,506
  • 3
  • 5
  • 16
marvinsxtr
  • 43
  • 5
  • 1
    Please can you further motivate your problem by e.g. plotting the two distributions of responses that form your metrics, one for each group? The test statistic for M-W has a mean of around 1/2 the product of sample sizes in both groups, and if the distributions are sufficiently different, the p-value will be small. – B.Liu Aug 22 '21 at 10:33
  • 1
    Your null hypothesis is incorrect: the rank sum tests $\text{H}_{0}\text{: }P(X_{A} > X_{B}) = 0.5$ with $\text{H}_{\text{A}}\text{: }P(X_{A} > X_{B}) \ne 0.5$. If you can support the addition stringent added assumptions—(1) the distributions have the same shape, and (2) have the same dispersion—then you can interpret the null as mean equality and as median equality, with the test providing evidence of location shift. Wilcoxon's [original article](https://link.springer.com/chapter/10.1007/978-1-4612-4380-9_16) bears a read. :) – Alexis Aug 22 '21 at 16:45
  • @Alexis Interesting, honestly I got the hypothesis from the scipy docs and didn't think much about it (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html). – marvinsxtr Aug 22 '21 at 16:59
  • @Alexis I read [this article](https://statistics.laerd.com/statistical-guides/mann-whitney-u-test-assumptions.php) on the topic, suggesting that the null hypothesis remains the same in both cases (different or same shape, i.e. detecting equal distributions or changes in median) and only the alternative hypothesis changes. Did I miss something here? – marvinsxtr Aug 22 '21 at 17:54
  • 2
    @marvinsxtr You did: the null hypothesis is, as Wilcoxon, and Mann and Whitney wrote, a test for evidence of whether one group is "stochastically larger" than the other, exactly as I expressed in my first comment. If the distributions have different shapes or different dispersions, then it is possible to correctly reject the null I wrote above *without* a corresponding difference in means, or in medians, and vice versa. See [Mann & Whitney](https://projecteuclid.org/download/pdf_1/euclid.aoms/1177730491) also. – Alexis Aug 22 '21 at 19:19
  • Thank you for the clarification :) – marvinsxtr Aug 22 '21 at 21:01

1 Answers1

11

Nothing out of the ordinary is going on from the sound of it.

In almost all cases, I get huge values for the statistic

Have you looked at the range of possible values for the statistic?

For the usual form of the U-statistic, it can take values between $0$ and $mn$ where $m$ and $n$ are the two sample sizes.

If you divide the statistic by $mn$, you get $\frac{U}{mn}$, which is the proportion (rather than the count) of cases in which a value from one sample exceeds a value from the other, which takes values between $0$ and $1$. The null case corresponds to an expected proportion of $\frac12$ (with standard error $\sqrt{\frac{m+n+1}{12mn}}$).

Alternatively, you could look at a z-score, which you may find somewhat more intuitive than the raw test statistic.

Is it possible to get p-values around zero with these tests and what does that mean?

Certainly. Unless your sample sizes are very small, extremely small p-values are possible.

For a one-tailed test the p-value may be as small as $\frac{m!\, n!}{(m+n)!}$ and twice that for a two-tailed test. For example with small sample sizes of $m=n=10$, you could see a two-tailed p-value as small as $1/92378$ or about $0.000011$ and the smallest available p-values decrease very rapidly as sample sizes increase. Doubling both sample sizes to $m=n=20$ reduces the smallest possible p-value by a factor of about $746000$, to $1.45\times 10^{-11}$.

Is there a cause which could explain this behavior?

A small to moderate effect size with large samples or a large effect size with smaller samples can both do it.

Glen_b
  • 257,508
  • 32
  • 553
  • 939
  • 3
    It's salutary to phrase WMW as estimating the probability that values in one group are greater than values in the other. It's common that this is close to 0.5 (whether slightly above or slightly below depends on labelling only). It's an unfortunate detour lasting 70+ years to put more emphasis on the test than on an estimate of this quantity. – Nick Cox Aug 22 '21 at 12:17
  • 2
    I wholeheartedly agree -- indeed I just finished off an incomplete edit from shortly after my initial posting adding a little on this view of the statistic. – Glen_b Aug 22 '21 at 16:22
  • Thank you for your very detailed answer, this definitely helped me out. Can you provide a source for the proportion interpretation of the test and how the error was calculated? – marvinsxtr Aug 22 '21 at 17:21
  • 2
    Birnbaum, Z.W. 1956. On a use of the Mann-Whitney statistic. _Third Berkeley Symposium on Mathematical Statistics and Probability_ 1: 13-17. Wolfe, Douglas A., and Robert V. Hogg. 1971. On constructing statistics and reporting data. _The American Statistician_ 25(4): 27-30. doi:10.2307/2682922. – Nick Cox Aug 22 '21 at 20:51
  • 3
    @NickCox Thanks for that. There's also the immediate fact that U is a count of 'successes' divided by a total number of attempts (It's not binomial, however) that invites a direct 'sample proportion estimating a population proportion` interpretation. Mann and Whitney's original paper talks directly about the probability of an observation from one sample being smaller than an observation from the other under the null (which they identify as being $\frac12$). Indeed, this proportion/probability identification arguably goes all the way ... ctd – Glen_b Aug 22 '21 at 22:48
  • 3
    ctd... back to Deuchler (1914); he construes the test as a set of pairwise comparisons between groups, and writes the statistic as the proportion of (A>B) - proportion of (A – Glen_b Aug 22 '21 at 22:54
  • 3
    Returning to Mann and Whitney, when considering consistency in section 5 they explicitly write $P(x_i>y_j)$ in discussing the expectation of the statistic under the alternative, and then identify $E(U)$ as being $mn$ times this probability; and constructing binary indicator variable for this event ($x_i>y_j$). They seem to be aware of the interpretation. – Glen_b Aug 22 '21 at 22:59
  • 3
    Then there's recognizing that the Mann-Whitney U is $mn$ times a [*U-statistic*](https://en.wikipedia.org/wiki/U-statistic) (for which Hoeffding, 1948 is the main reference, but Lehmann, 1951 ('Consistency and Unbiasedness of Certain Nonparametric Tests,' *The Annals of Mathematical Statistics*, 22(2), 165-179) discusses the case of two-sample statistics and covers the Mann-Whitney statistic, where the connection to P(Y>X) is explicit. However, all this aside, NickCox's reference will probably be the one you want. – Glen_b Aug 22 '21 at 23:51
  • 3
    @marvinsxtr re the standard error: $\text{s.e.}(\frac{1}{mn} U) = \frac{1}{mn}\text{s.e.}( U)$ (this follows from [basic properties of variance](https://en.wikipedia.org/wiki/Variance#Basic_properties) - $\text{Var}(aX) = a^2 \text{Var}(X)$ ). – Glen_b Aug 23 '21 at 00:55