Traditionally, these rank-based tests were not recommended
for use when there are many ties. However, implementations of this test in some
statistical software compute useful approximate P-values for data containing ties, often with a warning that these P-values are not exact.
Challenger Data. Data presented to a Presidential Commission to investigate the explosion of the space shuttle Challenger in 1986, showed results of partial (non-catastrophic) O-ring failures on 24
previous shuttle launches at temperatures above and below 65 degrees Fahrenheit were as follows
cold: 1 1 1 3
warm: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 2
Permutation test: In their textbook Statistical Sleuth, Ramsey and Schafer report the exact
P-value 0.00988 for a one-sided permutation test using the pooled t statistic as metric. (Pages 82 and 91.) This exact P-value can be computed by moderately tedious combinatorial methods.
A very good approximate P-value 0.01 is found by a simulation in R:
x = c(cold, warm); g = c(rep(1,4), rep(2,20))
t.obs = t.test(x ~ g, alt="g", var.eq=T)$stat
set.seed(707)
t.prm = replicate(10^5, t.test(x ~ sample(g), alt="g", var.eq=T)$stat)
mean(t.prm >= t.obs)
[1] 0.01009
Wilcoxon RS: The P-value 0.0006 results from a one-sided Wilcoxon rank sum test, as implemented in R:
wilcox.test(cold, warm, alt="g")$p.val
[1] 0.0005720256
Warning message:
In wilcox.test.default(cold, warm, alt = "g") :
cannot compute exact p-value with ties
Welch t test: P-value 0.038 results from a one-sided Welch t test.
t.test(cold, warm, alt="g")$p.val
[1] 0.0384483
Fisher exact test: A one-sided Fisher exact test (based on a hypergeometric model) looking at categories 'No Failures' and 'At least One Failure' gives P-value 0.003.
Out of 17 failure-free launches, none were among the four in cold weather.
phyper(0, 17, 7, 4)
[1] 0.003293808
Which test is 'best' here?
- Assurances of well-approximated P-values notwithstanding, I would wonder whether to use the Wilcoxon test in the face of so very many ties.
- Legendary robustness or not, I would wonder
about the accuracy of the P-value from the Welch t test.
- The permutation test and Fisher's exact test seem to rest on more solid ground. (Although the Fisher test may lose some power by reducing results to two categories.)
Note: The Commission concluded that O-rings used in the shuttles were not sufficiently pliable at cooler temperatures to provide a safe fuel seal between sections of booster rockets. Google 'Challenger commission' or see Feynman, R.P (1988): "What do you care what other people think," Norton.