9

I'm confused by the following, and I haven't been able to dig up the answer elsewhere.

I'm trying to learn R while doing some statistics, and, as an exercise, I try to double-check the results of the built-in R functions by also doing these 'by hand', as it were, in R. However, for the Kruskal-Wallis test I keep getting different results, and I can't figure out why.

For example, I'm looking at the following data handed out in an exercise

activity <- c(2, 4, 3, 2, 3, 3, 4, 0, 4, 3, 4, 0, 0, 1, 3, 1, 2, 0, 3, 1, 0, 3, 4, 0, 1, 2, 2, 2, 3, 2) 
group <- c(rep("A", 11), rep("B", 10), rep("C", 9))
group <- factor(group)
data.raw <- data.frame(activity, group)

And I want to analyse activity by group. First I run a Kruskal-Wallis test using the in-built R function

kruskal.test(activity ~ group, data = data.raw)

Which returns $H = 8.9056$.

To double-check, I try doing the same 'by hand' in R, with the following (no doubt helpless) code

rank <- rank(activity)
data.rank <- data.frame(rank, group)
rank.sum <- aggregate(rank ~ group, data = data.rank, sum)

x <- rank.sum[1,2]^2 / 11 + rank.sum[2,2]^2 / 10 + rank.sum[3,2]^2 / 9
H <- (12 / (length(activity) * (length(activity) + 1))) * x - 3 * (length(activity) + 1)
H

Which is meant to reflect the following formula:

$$ H =\frac{12}{N(N+1)}\sum_{i = 1}^g \left(\frac{R^2_i}{n_i} \right) - 3(N + 1)$$

Where $N$ is the total number of observations, $g$ is the number of groups, $n_i$ is the number of observations in the $i$th group, and $R_i$ is the sum of ranks of the $i$th group.

And now I get $H = 8.499$, which, adding to my confusion, is also the answer given for the exercise in question. I've tried this for a couple of different data sets, and I tend to get a slightly higher value for $H$ using the in-built function.

I've tried searching to figure out what I'm doing wrong or failing to understand, but to no avail. Can anyone help me understand why the inbuilt kruskal.test function returns a different value from the one I get by spelling things out?

ttnphns
  • 51,648
  • 40
  • 253
  • 462
MSR
  • 93
  • 5

1 Answers1

12

kruskal.test applies a correction for ties as described in this Wikipedia article (point 4):

A correction for ties if using the short-cut formula described in the previous point can be made by dividing H by $1 - \frac{\sum_{i=1}^G (t_i^3 - t_i)}{N^3-N}$, ...

Continuing from your code:

TIES <- table(activity)
H / (1 - sum(TIES^3 - TIES)/(length(activity)^3 - length(activity)))
#[1] 8.9056

You can find out what the R function does by carefully studying the code, which you can see using getAnywhere(kruskal.test.default).

Roland
  • 5,758
  • 1
  • 28
  • 60
  • Lovely, thank you very much! As a follow-up, if I may: Is there an easy way to ask `kruskal.test` not to apply this correction for ties? – MSR Jan 09 '17 at 14:56
  • This should have been answered by StackOverflow instead of here. It is only about the software. – Michael R. Chernick Jan 09 '17 at 15:18
  • @MSR No, that's not possible. Why would you want that as it would be a less correct approximation? – Roland Jan 09 '17 at 15:20
  • 4
    @MichaelChernick No, it's not. The point is that OP has been taught a simplification of the test that should be used only if there are no ties. – Roland Jan 09 '17 at 15:21
  • This is a difficult call. There is a little statistical content to question and your answer is certainly helpful. It ws obviously satisfactory to the OP and I will upvote it. My point is very often such questions are closed or migrated before any answer are given and that does follow the site policy. – Michael R. Chernick Jan 09 '17 at 15:42
  • 4
    @MichaelChernick I'm not saying it wouldn't fit at Stack Overflow. But I'd argue that it fits equally well at CV. Obviously, it would have been helpful if OP had not only shared their code but also the formulas they are using. – Roland Jan 09 '17 at 15:45
  • 3
    @Michael The status of this thread is an easy call: it's squarely within our purview because it seeks to understand a statistical test. – whuber Jan 09 '17 at 16:16
  • @Roland: The follow-up was largely curiosity, but also I figured it might come in handy for exercises where answers are given without correction, as a quick double-check that everything's in order. – MSR Jan 09 '17 at 18:04
  • 2
    Edited to include the formula reflected in the code. Should've thought to do so the first time around. Apologies. – MSR Jan 09 '17 at 18:31
  • 3
    See also the R `Hmisc` package `spearman2` function which uses midranks for ties and an `F` test to get Kruskal-Wallis. I think this is more accurate than some methods. – Frank Harrell Jan 09 '17 at 20:28