Which result to choose when Kruskal-Wallis and Mann-Whitney seem to return contradicting results?

Question

I have these groups where the values are responses to a 10-point Likert item:

g1 <- c(10,9,10,9,10,8,9)
g2 <- c(4,9,4,9,8,8,8)
g3 <- c(9,7,9,4,8,9,10)

Therefore I used Kruskal-Wallis to determine any differences between responses in the groups, and the result was:

Kruskal-Wallis chi-squared = 5.9554, df = 2, p-value = 0.05091

However, if I run an exact Mann-Whitney test between groups g1 and g2 I get:

Exact Wilcoxon Mann-Whitney Rank Sum Test (using coin::wilcox_test)
Z = 2.3939, p-value = 0.02797

which returns a significant difference at alpha = 0.05.

Which test should I choose, and why?

For some laughs and on the topic of black and white cut offs: https://mchankins.wordpress.com/2013/04/21/still-not-significant-2/ — Hank, Jul 15 '17 at 11:58

Michael R. Chernick · Answer 1 · 2012-08-09T22:17:31.897

14

The Mann-Whitney or Wilcoxon test compares two groups while the Kruskal-Wallis test compares 3. Just like in the ordinary ANOVA with three or more groups the procedure generally suggested is to do the overall ANOVA F test first and then look at pairwise comparisons in case there is a significant difference. I would do the same here with the nonparametric ANOVA. My interpetation of your result is that there is marginally a significant difference between groups at level 0.05 and if you accept that then the difference based on the Mann-Whitney test indicates that it could be attributed to g$_1$ and g$_2$ being significantly different.

Don't get hung up with the magic of the 0.05 significance level! Just because the Kruskal-Wallis test gives p-value slightly over 0.05, don't take that to mean that there is no statistically significant difference between the groups. Also the fact that the Mann-Whitney test gives a p-value for the difference between g$_1$ and g$_2$ a little below 0.03 does not somehow make the difference between the two groups highly significant. Both p-values are close to 0.05. A slightly different data set could easily change to Kruskal-Wallis p-value by that much.

Any thought you might have that the results are contradictory would have to come from thinking of a 0.05 cut off as black and white boundary with no gray area in the neighborhood of 0.05. I think these results are reasonable and quite compatible.

edited Aug 09 '12 at 22:17

answered Aug 09 '12 at 21:36

Michael R. Chernick

39,640
28
74
143

2

You will better communicate your answer when you re-read it for errors (in punctuation, grammar, typography, and spelling) and use effective formatting. Please review the [Markdown help page](http://stats.stackexchange.com/editing-help). – whuber Aug 09 '12 at 21:44
The more classic view is that you failed to find statistical significance with your first test, so you should not report (in a professional publication) any further tests as statistically significant indications of between group differences. To do so is to use an alpha other than .05. This is particularly problematic (from the classical view) because you did not choose the higher alpha before conducting the test, so your alpha is unknown. Of course, when you try to understand your data, to guide your own future research program, you can take note of the difference between groups 1 and 2. – Joel W. Aug 09 '12 at 21:53
@JoelW. Are you trying to tell me that 0.05091 is really different from 0.05? Anyway my point is not how to report the conclusions but rather to say that the two tests don't conflict. I agree that how you analyze the data should be specified in advance before looking at the data. – Michael R. Chernick Aug 09 '12 at 22:11
1

@whuber Sorry for not editing the post earlier. I hope it looks a lot better now. – Michael R. Chernick Aug 09 '12 at 22:18
@JoelW Your 'more classic' view is actually Neyman's 'inductive behaviour' approach to inference. It is relevant to a small subset of the uses of statistics in support of inference. It is most unfortuate that it is presented so often as being classic. – Michael Lew Aug 09 '12 at 22:51
@MichaelLew, do you mean there are recommended ways to select your alpha level (other than in advance of calculating your test statistic) in making inferences? If so, what are those ways? – Joel W. Aug 10 '12 at 13:12
@MichaelLew, what are the types of "uses of statistics in support of inference" that you are referring to (other than null hypothesis testing)? – Joel W. Aug 10 '12 at 13:14
@JoelW The only thing I can imagine that he is referring to would be Bayesian inference where prior belief in the null hypothesis is combined with observed data leads to posteriori odds favoring the null hypothesis vs alternatives. – Michael R. Chernick Aug 10 '12 at 13:37
@MichaelChernick, are you saying, or are you suggesting MichaelLew is saying, that it is ok to select your alpha level after running your (Bayesian) statistical test? – Joel W. Aug 10 '12 at 15:40
No Joel all i am doing is suggesting that he may be referring to taking a Bayesian approach to hpyothesis testing rather than the Neyman-Pearson frequentist approach. Significance levels for tests have nothing to do with Bayesian ideas. – Michael R. Chernick Aug 10 '12 at 16:27
@MichaelChernick, but decision rules can be used in Bayesian statistics (e.g., Kruschke's article on [Bayesian data analysis](http://www.indiana.edu/~kruschke/articles/Kruschke2010WIRES.pdf), see the example pp. 10-11). Are you suggesting it is proper to set up these decision rules (akin to significance tests) after the fact, that is, after the Bayesian analysis is run? – Joel W. Aug 10 '12 at 19:31
@JoelW. Your link doesn't work. Why are you trying to read so much into what I say. The common Bayesian approach to hypothesis testing is to construct an odds ratio for the null hypothesis. If you like you can specify that it will be the analysis approach you will take prior to doing the analysis. What decision rule do you think the Bayesian would like to make regrading hypothesis testing? I suppose you could set a threshold on the odds against the null hypothesis and reject if the posterior odds are greater than that specified value. – Michael R. Chernick Aug 10 '12 at 19:42
@JoelW. I fixed your broken link. It's easier to use [Markdown](http://stats.stackexchange.com/editing-help), for linking to external reference (e.g., `[Markdown](http://stats.stackexchange.com/editing-help)`). – chl Aug 10 '12 at 19:55
@MichaelChernick, it just seemed that you may have been treating decision rules as flexible. I wondered the extent to which your flexibility extended. Thanks, Chi, for telling me about the markdown page. How do I find other such pages on this site? – Joel W. Aug 10 '12 at 20:09
@JoelW The Kruschke article is a highly opinionated anti-frequentist article. I do not agree with much of what he says. But putting that aside and assuming the Bayesian approach is taken what specific decision rules do you have in mind other than the odds of the null hypothesis being true? – Michael R. Chernick Aug 10 '12 at 20:11
@JoelW. I'm not advocating choosing of alpha after the experiment. I am advocating not using an alpha at all. Neither did I advocate a Bayesian approach, although if real prior probabilities are available then that would obviously be sensible. The N-P approach is not useful in many circumstances because scientists almost never adhere to the behavioural expectation that allows the N-P approach to deliver type I error rate equal to or less than alpha. Error rates are rarely interesting to a scientist. Evidence is. – Michael Lew Aug 11 '12 at 00:34
Neyman-Pearson hypothesis testing is not the only frequentist framework for statistics. Frequentists ideas relate to the nature of probability as the long run frequency of events (real and notional). The type I and type II error rate ideas are one way of using that frequentist notion of probability, but they are not necessary or desirable in many cases. See my paper linked in my answer to the original question above for more detail and a good starting set of references. – Michael Lew Aug 11 '12 at 00:40
I am perfectly happy with the Neyman-Pearson approach to hypothesis testing. There may be other ways to look testing in the frequentist framework but if you view it in terms of rejecting versus not rejecting the null hypothesis of no difference then to some extent you are trading off between type I and type II error. – Michael R. Chernick Aug 11 '12 at 01:19
@MichaelChernick I'm definitely _not_ advocating rejection or non-rejection of the null hypothesis. I think that you should read my paper because you seem to be mixing incompatible models of inference. – Michael Lew Aug 11 '12 at 03:45
@MichaelLew I would read your paper if I were interested in your argument but I am not. Given that I am not in any position to argue about your approach to hypothesis testing. – Michael R. Chernick Aug 13 '12 at 01:16

Michael Lew · Accepted Answer · 2019-11-03T20:07:51.330

12

I agree with Michael Chernick's answer, but think that it can be made a little stronger. Ignore the 0.05 cutoff in most circumstances. It is only relevant to the Neyman-Pearson approach which is largely irrelevant to the inferential use of statistics in many areas of science.

Both tests indicate that your data contains moderate evidence against the null hypothesis. Consider that evidence in light of whatever you know about the system and the consequences that follow from decisions (or indecision) about the state of the real world. Argue a reasoned case and proceed in a manner that acknowledges the possibility of subsequent re-evaluation.

I explain more in this paper: http://www.ncbi.nlm.nih.gov/pubmed/22394284

[Addendum added Nov 2019: I have a new reference that explains the issues in more detail https://arxiv.org/abs/1910.02042v1 ]

edited Nov 03 '19 at 20:07

answered Aug 09 '12 at 22:48

Michael Lew

10,995
2
29
47

@MichaelChernick I have became to learn from you that there is much more on statistics than just looking for "p<0.05". Michael Lew: I've downloaded your paper and will give it a read for sure. I'll follow your suggestion to have a good reasoning about my data in this situation. Thank you all! – mljrg Aug 10 '12 at 00:51
3

@MichaelLew I don't share your dim view of the Neyman-Pearson approach to hypothesis testing. I still think it is fundamental to frequentist inference. It is only the strict adherence to the 0.05 level that I object to. – Michael R. Chernick Aug 10 '12 at 01:15
@MichaelChernick So, are you saying that one should choose a cutoff for significance prior to the experiment, or that you can choose it after the results are in. The first is OK, but the second is not. Neyman-Pearson approach deals with error rates, and the type I error rate is only protected when the cutoff for significance is chosen in advance. Thus if you advise someone that a little over 0.05 is close enough because they might have chosen a higher cutoff, then you are not actually using the Neyman-Pearson approach, but an ill-formed hybrid approach as I explain in the paper linked. – Michael Lew Aug 11 '12 at 00:29
People can choose 0.01, 0.05 or 0.10 if they want. This should be done without being influenced by the data. But the choice of 0.01 or 0.05 is not the issue I refer to. It is the black and white belief in the significance level as those 0.049 means statistical significance and 0.0501 is not! – Michael R. Chernick Aug 11 '12 at 00:44
1

Scientists are interesting in evidence but they are not hung up on the methodology used to decide significance. – Michael R. Chernick Aug 11 '12 at 00:45
@MichaelChernick If you use the Neyman-Pearson approach, as your comment below the other answer implies, then when alpha is 0.05 a result of P=0.049 _is_ different to a result of P=0.051. That is because the Neyman-Pearson approach eschews the concept of evidential meaning of data in favour of error rates. You cannot use Neyman-Pearson and talk about evidence. If you won't read my paper, then perhaps you can read this one by Steve Goodman: – Michael Lew Aug 11 '12 at 03:50
@MichaelLew No the Neyman-Pearson appraoch has nothing to do with people being overly strict about alpha. Yes it does say to fix alpha but interpretation of 0.049 vs 0.0501 is not the fault of the method. – Michael R. Chernick Aug 11 '12 at 04:17

score 5 · Answer 3 · answered Nov 11 '16 at 10:46

Results of Kruskal-Wallis and Mann-Whitney U test may differ because

The ranks used for the Mann-Whitney U test are not the ranks used by the Kruskal-Wallis test; and
The rank sum tests do not use the pooled variance implied by the Kruskal-Wallis null hypothesis.

Hence, it is not recommended to use Mann-whitney U test as a post hoc test after Kruskal-Wallis test.

Other tests like Dunn's test (commonly used), Conover-Iman and Dwass-Steel-Citchlow-Fligner tests cane be used as post-hoc test for kruskal-wallis test.

score 3 · Answer 4 · answered Feb 15 '13 at 21:26

This is in answer to @vinesh as well as looking at the general principle in the original question.

There are really 2 issues here with multiple comparisons: as we increase the number of comparisons being made we have more information which makes it easier to see real differences, but the increased number of comparisons also makes it easier to see differences that don't exist (false positives, data dredging, torturing the data until it confesses).

Think of a class with 100 students, each of the students is given a fair coin and told to flip the coin 10 times and use the results to test the null hypothesis that the proportion of heads is 50%. We would expect p-values to range between 0 and 1 and just by chance we would expect to see around 5 of the students get p-values less than 0.05. In fact we would be very surprised if none of them obtained a p-value less than 0.05 (less than 1% chance of that happening). If we only look at the few significant values and ignore all the others then we will falsely conclude that the coins are biased, but if we use a technique that takes into account the multiple comparisons then we will likely still judge correctly that the coins are fair (or at least fail to reject that they or fair).

On the other hand, consider a similar case where we have 10 students rolling a die and determining if the value is in the set {1,2,3} or the set {4,5,6} each of which will have 50% chance each roll if the die is fair (but could be different if the die is rigged). All 10 students compute p-values (null is 50%) and get values between 0.06 and 0.25. Now in this case none of them reached the magic 5% cut-off, so looking at any individual students results will not result in a non-fair declaration, but all the p-values are less than 0.5, if all the dice are fair then the p-values should be uniformly distributed and have a 50% chance of being above 0.5. The chance of getting 10 independent p-values all less than 0.5 when the nulls are true is less that the magic 0.05 and this suggests that the dice are biased, we just did not have enough power to detect this in the individual trials, but grouping the information shows the null is false.

Now coin flipping and die rolling are a bit contrived, so a different example: I have a new drug that I want to test. My budget allows me to test the drug on 1,000 subjects (this will be a paired comparison with each subject being their own control). I am considering 2 different study designs, in the first I recruite 1,000 subjects do the study and report a single p-value. In the second design I recruite 1,000 subjects but break them into 100 groups of 10 each, I do the study on each of the 100 groups of 10 and compute a p-value for each group (100 total p-values). Think about the potential differences between the 2 methodologies and how the conclusions could differ. An objective approach would require that both study designs lead to the same conclusion (given the same 1,000 patients and everything else is the same).

@mljrg, why did you choose to compare g1 and g2? If this was a question of interest before collecting any data then the MW p-value is reasonable and meaningful, however if you did the KW test, then looked to see which 2 groups were the most different and did the MW test only on those that looked the most different, then the assumptions for the MW test were violated and the MW p-value is meaningless and the KW p-value is the only one with potential meaning.

Which result to choose when Kruskal-Wallis and Mann-Whitney seem to return contradicting results?

4 Answers4

Linked