20

Some authors (e.g. Pallant, 2007, p. 225; see image below) suggest to calculate the effect size for a Wilcoxon signed rank test by dividing the test statistic by the square root of the number of observations:

$r = \frac{Z}{\sqrt{n_x + n_y}}$

Z is the test statistic output by SPSS (see image below) as well as by wilcoxsign_test in R. (See also my related question: teststatistic vs linearstatistic in wilcoxsign_test)

Others suggest the Bravais-Pearson ($r = \frac{cov(XY)}{sd(X) \times sd(Y)}$) or Spearman ($r_S$) correlation coefficients (depending on data type).

When you calculate them, the two rs are not even remotely the same. E.g., for my current data:

r = 0.23   ( for $r = \frac{Z}{\sqrt{n_x + n_y}}$ )

r = 0.43   ( Pearson )

These would imply quite different effect sizes.

So which is the correct effect size to use, and how do the two rs relate to each other?


Pages 224 (bottom part) and 225 from Pallant, J. (2007). SPSS Survival Manual:

enter image description here

enter image description here

  • Bravais-Pearson is a new one on me. I take it this is another case of Pearson getting credit when someone else was there first? – Glen_b Jan 12 '15 at 07:18
  • 1
    Ah, yes, [looks like maybe that's it](http://translate.google.com/translate?hl=en&sl=de&u=http://de.wikipedia.org/wiki/Korrelationskoeffizient). – Glen_b Jan 12 '15 at 07:20
  • @Glen_b Yes, that's it. I'm sorry, I always find it difficult and confusing when I have to translate statistical terminology into English. Please edit the question if you know the proper term(s). –  Jan 12 '15 at 07:24
  • 3
    I'd much rather leave it as it is - if Bravais deserves credit in one language, he deserves it in another! I appreciate the filling of a gap in my education. – Glen_b Jan 12 '15 at 07:27
  • lol I added the formula to make it clear what I mean. –  Jan 12 '15 at 07:28
  • Who says which and what justification do they offer? Who calls the signed rank statistic $Z$? (or is that a standardized signed rank statistic?). In what sense are they an effect size? – Glen_b Jan 12 '15 at 09:01
  • As for Z, that is what R and SPSS output. See also my other question: http://stackoverflow.com/questions/27896655/teststatistic-vs-linearstatistic-in-wilcoxsign-test That it can be used to calculate effect sizes is for example said in Pallant, J. (2007). *SPSS Survival Manual*. p. 225. –  Jan 12 '15 at 09:47
  • Ah, I see from your linked question you *don't* mean R's signed rank test, you mean one in the package `coin`. – Glen_b Jan 12 '15 at 09:58
  • 1
    Yes, because I need a test that can handle ties. –  Jan 12 '15 at 10:16
  • Why not to compute effect size as $z/\sqrt{n}$ analogously to paired-sample t-test's effect size $t/\sqrt{n}$? – ttnphns Jan 12 '15 at 12:04
  • 2
    The instruction in the book I quote in my comment above (Pallant, 2007, p. 225) says that the `n` in $\sqrt{n}$ is the number of all observations, that is the sum of the length of both vectors, i.e. $n = n_x + n_y$, not the number of participants. So the formula is the same, you only have to correctly understand what "n" stands for. If that is wrong, please educate me. This is after all what my question is aiming at. –  Jan 12 '15 at 12:12
  • @ttnphns See the image I attached to my question. –  Jan 12 '15 at 12:41
  • Very strange. Why do they state that the effect size may be calculated exactly as for the independent-samples test (Mann-Whitney)? It looks to me incorrect. – ttnphns Jan 12 '15 at 13:22
  • $X$,$Y$ and $Z$ reflect only the ranks. The ranks however are "artificial". You interpret the statistics in terms of the observation, not the ranks. Therefore power calculations or CI in terms of some location model, that translates the "natural" effect size to the rank statistics world make sense. So I'm not sure if the procedures in this question actually are useful. – Horst Grünbusch Jan 12 '15 at 14:10
  • @HorstGrünbusch I use the Hodges-Lehmann estimator to calculate effect size, but want to report a more traditional measure alongside it, such as Spearman's correlation (tha data is ordinal and the distribution unknown but not normal). I stumbled upon the first formula and just want to understand it. As you can see from my other question, I don't even understand what that Z is. –  Jan 12 '15 at 17:43
  • Hodges-Lehmann pseudomedian isn't a standardized measure. Effect size by definition must be a standardized measure. – ttnphns Jan 12 '15 at 21:23
  • Then what *is* an appropriate-to-ordinal-nonnornal-data standardized measure, and why (source)? –  Jan 13 '15 at 06:04
  • 1
    I personally thought that Z/sqrt(n) might be one option. Wikipedia on Mann-Whitney links to a pdf paper by Kirby which considers paired Wilcoxon as well; I haven't read the article myself. – ttnphns Jan 13 '15 at 08:57
  • When Wilcoxon is a paired test, there is only one $n$. When doing Wilcoxon-Mann-Whitney, there are two independent samples with different $n$'s. – Carl Sep 19 '16 at 21:27
  • @Carl what do you mean by one n, do you mean the formula is Z/sqrt(n+n)? – RockTheStar Oct 11 '16 at 20:21
  • @what can you provides references for which the authors suggest to use Bravais-Pearson or spearman coefficient for effect size? Thanks. – RockTheStar Oct 11 '16 at 20:31
  • @RockTheStar I'm sorry, this question is one and a half years old – I don't remember what I was reading back then. –  Oct 11 '16 at 20:40
  • 1
    @RockTheStar The images above are for the [Wilcoxon signed rank test](https://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test), and the question preceding looks like some variation of the [Wilcoxon rank sum test](https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test#.CF.81_statistic), AKA Mann-Whitney test, $\rho$ statistic. – Carl Oct 11 '16 at 21:21
  • 1
    @RockTheStar Look here: https://en.wikipedia.org/wiki/Effect_size and here: http://stats.stackexchange.com/questions/15749/how-to-report-effect-size-measures-r-and-r-squared-and-what-is-a-non-technical-e Both discuss the correlation coefficient as a measure for effect size. –  Oct 12 '16 at 04:54
  • @what, interesting. Hmm...so what's the default medium effect size for wilcoxon? I have seen 0.3 or 0.5 – RockTheStar Oct 12 '16 at 18:06

2 Answers2

6
  • If you don't have ties, I would report the proportion of after values that are less than the corresponding before values.
  • If you do have ties, you could report the proportion of after values that are less than before out of the total number of non-tied pairs, or report all three proportions (<, =, >) and perhaps the sum of whichever two were more meaningful. For example, you could say '33% had less fear of statistics, 57% were unchanged, and 10% had more fear after the course such that 90% were the same as or better than before'.

Generally speaking, a hypothesis test will output a p-value that can be used to make a decision about whether or not to reject the null hypothesis while controlling for the type I error rate. The p-value, however, conflates the size of the effect with our amount of clarity that it is inconsistent with the null (in essence, how much data the test had access to). An effect size generally tries to extract the $N$ so as to isolate the magnitude of the effect. That line of reasoning illuminates the rationale behind dividing $z$ by $\sqrt N$. However, a major consideration with effect size measures is interpretability. Most commonly that consideration plays out in choosing between a raw effect size or a standardized effect size. (I suppose we could call $z/\sqrt N$ a standardized effect size, for what that's worth.) At any rate, my guess is that reporting $z/\sqrt N$ won't give people a quick, straightforward intuition into your effect.

There is another wrinkle, though. While you want an estimate of the size of the overall effect, people typically use the Wilcoxon signed rank test with data that are only ordinal. That is, where they don't trust that the data can reliably indicate the magnitude of the shift within a student, but only that a shift occurred. That brings me to the proportion improved discussed above.


On the other hand, if you do trust that the values are intrinsically meaningful (e.g., you only used the signed rank test for its robustness to normality and outliers), you could just use a raw mean or median difference, or the standardized mean difference as a measure of effect.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
  • 3
    +1 Your proposed effect measures are easily understood and also related to the test statistic. – John Dec 11 '16 at 09:04
3

Without knowing what kind of data were being assessed it's very hard to give good advice here. And really, that's all you can get. There's just no such thing as a best measure of effect size for questions like this... maybe ever.

The effect sizes mentioned in the question are all standardized effect sizes. But it's entirely possible that the means or medians of the original measures are just fine. For example, if you're measuring how long it takes for a manufacturing process to complete then the difference in times should be a perfectly reasonable effect size. Any changes in process, future measurements, measurements across systems, and measurements across factories, will all be in time. Maybe you want the mean or maybe you want the median, or even mode, but the first thing you need to do is look at the actual measurement scale and see if the effect size there is reasonable to interpret and strongly connected to the measure.

To assist in thinking about that, effects that should be standardized are things that are measured more indirectly and in many ways. For example, psychological scales can vary over time and in many ways and attempt to get at an underlying variable that is not directly being assessed. In those cases you want standardized effect sizes.

With standardized effect sizes the critical issue is not just which to use but what they mean. As you imply in your question, you also don't know what they mean and that's the critical thing. If you don't know what the standardized effect is then you can't report it correctly, interpret it correctly, or use it correctly. Further, if there are a variety of ways you'd like to discuss the data there is absolutely nothing stopping you from reporting more than one effect size. You can discuss your data in terms of linear relationship, like with product moment correlation, or in terms of relationship between the ranks with Spearman r and differences between those or just provide all the info in the table. There's nothing wrong with that at all. But more than anything you're going to have to decide what you want your results to mean. That's something that can't be answered from the information given and might require far more info and domain specific knowledge than is reasonable for a question in this kind of forum.

And always think meta-analytically about how you're reporting effects. Will people in the future be able to take the results I'm reporting and integrate them with others? Perhaps there's a standard in your field for these things. Perhaps you selected a non-parametric test primarily because you don't trust the conclusions others have made about underlying distributions and you want to be more conservative in your assumptions in a field that primarily uses parametric tests. In that case there's nothing wrong with additionally providing an effect size typically used with the parametric tests. These and many other issues need to be considered when thinking about how you place your finding in a larger literature of similar research. Typically good descriptive stats solve these problems.

So that's the primary advice. I have a few additional comments. If you want your effect size to be strongly related to the test you did then the Z based recommendation is obviously best. Your standardized effect size will mean the same thing as the test. But as soon as you're not doing that then there's nothing wrong with using most anything else, even something like Cohen's d that's associated with parametric tests. There is no assumption of normality for calculating means, standard deviations, or d scores. In fact, there are weaker assumptions than for the recommended correlation coefficient. And always report good descriptive measures. Again, descriptive measures have no assumptions you'd be violating but keep in mind their substantive meaning. You report descriptive stats that say something about your data you want to say and means and medians say different things.

If you want to discuss repeated measures versus independent design effect sizes then that's really a whole new question.

John
  • 21,167
  • 9
  • 48
  • 84