Nearly 0 p-value with Welch's t-test and near 1 with Mann Whitney U test

Question

I have carried out some t-tests on various groups in my data set. The response variable is the number of occurrences of an event in a day; for most observations, there are no occurrences, so the distribution is zero-inflated. In the tests, the alternative hypothesis is that those lacking the characteristic ("No" group) will have a lower average of occurrences in a day.

Mean: 1.917101. Variance: 24.92175

Mean: 2.095268. Variance: 12.80092

Because of the very highly skewed data, and despite the large sample sizes (Yes Group ~ 3700, No Group ~ 8000), I thought I'd carry out MWW tests for comparison. The MWW results for most groups are consistent with the t-test, but for the ones in the histograms above, I get a p-value of 0.014 for the t-test and 1 for the MWW.

The data is a bit unwieldy so I've tried to replicate it below with at least equally flummoxing results for the iris dataset.

library(dplyr)
library(purrr)

iris %>% 
  mutate(Species = if_else(Species == "versicolor", "Yes", "No")) %>% 
  summarise(w = list(wilcox.test(Petal.Length ~ Species, alternative = "less")),
            t = list(t.test(Petal.Length ~ Species, alternative = "less"))) %>% 
  mutate(t.pval = map_dbl(t, "p.value"),
         mww.pval = map_dbl(w, "p.value")) %>% 
  select(-w, -t)


#>         t.pval  mww.pval
#> 1 0.0004227251 0.4303032

What is driving this? I understand that the null hypotheses in both tests are not the same (equality in means vs equal probability that a randomly selected value from group No will be lower or higher than from group Yes); but I'm not sure how that explains the widely diverging results?

Any pointers would be appreciated!

related: https://stats.stackexchange.com/questions/416417/kruskal-wallis-and-negative-binomial-regression-do-not-agree/416438#416438 maybe an analogous picture would help explain? — Ben Bolker, Jul 26 '19 at 13:38
MWW is based on ranks. You may have so many 0's that medians of both groups are 0. Data are not particularly well suited for either MWW or 2-sample t (read about assumptions). Depending on purpose (not exactly clear here) maybe a permutation test with an appropriate metric would work better. Or maybe chi-squared test on $2 \times 2$ table tabulating counts of 0's and non-0's for Yes and No groups. Maybe look [here](https://stats.stackexchange.com/questions/416340/under-what-circumstances-does-mann-whitney-and-wilcoxon-signed-rank-test-fail/416388#416388). — BruceET, Jul 26 '19 at 15:11
Thank you @Ben & @Bruce! Goal is to compare the mean hours without supply of electricity among different Indian groups. I was just looking for something simpler than a [zero-inflated negative binomial](https://www.physiology.org/doi/full/10.1152/advan.00017.2010) model. If I understand Ben correctly, rank-based methods (or at least MWW) lose power to distinguish groups if the range is small, which affects both my count data and the limited range in `iris`. But how come both MWW and t-tests are considered equivalent for [analysing Likert-scale data](https://pareonline.net/getvn.asp?v=15&n=11)? — Fons MA, Jul 26 '19 at 23:56

Glen_b · Answer 1 · 2020-05-11T17:48:14.613

Certainly we can consider what's happening in the iris data. I will focus on two-sided tests rather than one-sided because the explanation is a little simpler to give, but the reasoning is similar.

In the case of the iris data, and treating the Wilcoxon-Mann-Whitney as having location-shift alternative (which is not required) the two tests agree fairly well about what the location shift is -- in fact the WMW estimate of it is a bit larger than the Welch t-test (or indeed the ordinary t-test).

The big difference for the iris data is in how they see the uncertainty in that estimate of the difference.

In particular, note that one of the two groups you created is distinctly bimodal; this naturally inflates the estimated variance of the difference in means, but in the case of the Wilcoxon-Mann-Whitney test, the quantity of interest (for both sample and population) is the median of cross-sample pairwise differences. Because of the bimodality in one of the groups, the pairwise differences are also bimodal, in a way that the median is in a region of low density:

(it's necessary to emphasize at this stage that we're not looking at a difference of medians but a median of differences, and those two things are not the same)

Indeed, if I did everything right, the sample median of the cross-sample pairwise differences is in the big gap between those two humps.

If we knew the density, asymptotically the standard error of the median is proportional to the reciprocal of the density at the median in the population. Clearly that reciprocal would be large.

Conversely, to construct a nonparametric CI, where we have no assumed density shape, the endpoints of an interval for that population median would be two quantiles (symmetrically placed, in the sense that they'd have the same proportion of data beyond them, ties permitting). Which quantiles are involved depends on the sample size. However, because of the substantial gap between the modes, even if a fairly large sample these quantiles are a fair way apart. Again we see that low density in the region of the sample median will tend to make an interval for the population median large.

This means that a CI for the median of pairwise differences will be wide. In this case the width of an 95% (two-sided) interval is 3, and it extends some way past 0. The interval for the mean from a Welch test is less than half as wide in this case, more than compensating for the difference in means being somewhat closer to 0.

In your original pair of samples, it doesn't look to me like the explanation in that case will be the same as for the iris data.

Nearly 0 p-value with Welch's t-test and near 1 with Mann Whitney U test

1 Answers1