Under what circumstances does Mann-Whitney and Wilcoxon signed-rank test fail?

Question

I read from here that

The advice must be modified somewhat when the distributions are both strongly skewed and very discrete, such as Likert scale items where most of the observations are in one of the end categories. Then the Wilcoxon-Mann-Whitney isn’t necessarily a better choice than the t-test.

I'm trying to use Mann-Whitney as a non-parametric alternative of t-test, when distributions are not normal. But it seems that there are cases in which even Mann-Whitney isn't necessarily a better choice than parametric method.

Under what circumstances does Mann-Whitney test fail, and what are the alternatives that I have in such cases?

Please explain in cases of both independent and dependent samples (Mann-Whitney and Wilcoxon signed-rank)

Not being a "better choice" is not tantamount to "failure." Statistical procedures are selected for their abilities to yield good decisions either on average or in the worst cases. The use of a procedure that is "not a better choice" can still be indicated from other considerations such as simplicity, interpretability, robustness, and so forth. — whuber, Jul 07 '19 at 13:16
That quote sounds like me (and the advice comes from results in one of the references I wrote about). Can you clarify the sense in which you intend "fails"? — Glen_b, Nov 03 '20 at 10:31

BruceET · Answer 1 · 2019-07-07T23:26:02.503

Traditionally, these rank-based tests were not recommended for use when there are many ties. However, implementations of this test in some statistical software compute useful approximate P-values for data containing ties, often with a warning that these P-values are not exact.

Challenger Data. Data presented to a Presidential Commission to investigate the explosion of the space shuttle Challenger in 1986, showed results of partial (non-catastrophic) O-ring failures on 24 previous shuttle launches at temperatures above and below 65 degrees Fahrenheit were as follows

cold:  1 1 1 3
warm:  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 2

Permutation test: In their textbook Statistical Sleuth, Ramsey and Schafer report the exact P-value 0.00988 for a one-sided permutation test using the pooled t statistic as metric. (Pages 82 and 91.) This exact P-value can be computed by moderately tedious combinatorial methods.

A very good approximate P-value 0.01 is found by a simulation in R:

x = c(cold, warm); g = c(rep(1,4), rep(2,20))
t.obs = t.test(x ~ g, alt="g", var.eq=T)$stat
set.seed(707)
t.prm = replicate(10^5, t.test(x ~ sample(g), alt="g", var.eq=T)$stat)
mean(t.prm >= t.obs)
[1] 0.01009

Wilcoxon RS: The P-value 0.0006 results from a one-sided Wilcoxon rank sum test, as implemented in R:

wilcox.test(cold, warm, alt="g")$p.val
[1] 0.0005720256
Warning message:
In wilcox.test.default(cold, warm, alt = "g") :
  cannot compute exact p-value with ties

Welch t test: P-value 0.038 results from a one-sided Welch t test.

t.test(cold, warm, alt="g")$p.val
[1] 0.0384483

Fisher exact test: A one-sided Fisher exact test (based on a hypergeometric model) looking at categories 'No Failures' and 'At least One Failure' gives P-value 0.003. Out of 17 failure-free launches, none were among the four in cold weather.

phyper(0, 17, 7,  4)
[1] 0.003293808

Which test is 'best' here?

Assurances of well-approximated P-values notwithstanding, I would wonder whether to use the Wilcoxon test in the face of so very many ties.
Legendary robustness or not, I would wonder about the accuracy of the P-value from the Welch t test.
The permutation test and Fisher's exact test seem to rest on more solid ground. (Although the Fisher test may lose some power by reducing results to two categories.)

Note: The Commission concluded that O-rings used in the shuttles were not sufficiently pliable at cooler temperatures to provide a safe fuel seal between sections of booster rockets. Google 'Challenger commission' or see Feynman, R.P (1988): "What do you care what other people think," Norton.

Sorry about that. Maybe I missed the point of your question. Maybe you can explain what you mean by "Under what circumstances does Mann-Whitney test fail, and what are the alternatives that I have in such cases?" — BruceET, Jul 08 '19 at 04:14

score 1 · Answer 2 · answered Jul 18 '21 at 11:59

Considering only the Wilcoxon-Mann-Whitney two-sample test and not the Wilcoxon signed-rank test (which assumes symmetry of the distribution), the Wilcoxon two-sample test is a special case of the proportional odds semiparametric ordinal logistic regression model. When you use this model, not only can you adjust for other variables, but you can handle arbitrarily many ties in Y all the way to the extreme case where Y has only two value (binary logistic model). Thinking about this from a modeling perspective also exposes the assumption needed for the Wilcoxon test to be optimal: the logit of the cumulative distribution function for Y stratified by group results in two parallel curves. These need not be linear as in parametric models. This is equivalent to saying that there is a function of Y such that the transformed distributions are shifted by a constant, and the transformed difference has a logistic distribution.

Examples are given in the nonparametric statistics chapter of BBR. Models unify tests.

Under what circumstances does Mann-Whitney and Wilcoxon signed-rank test fail?

2 Answers2

Linked