4

There is some demand for effect-size measures in doing post hoc analyses, and I am trying to decide whether to provide for this or not in an R package of mine. If I do, I would concentrate on Cohen's $d$-style measures, which in the context of comparing the $i$th and $j$th means in a simple one-way experiment is defined as $$ d_{ij} = \frac{\mu_i - \mu_j}{\sigma} $$ Here, $\mu_i$ and $\mu_j$ are the means, and $\sigma$ is the error standard deviation of the data, assumed common for all treatments. I can figure out how to estimate $d_{ij}$ from observed data and construct a confidence interval for it that takes into account the uncertainty of estimates of all three parameters.

However, my questions have to do with extending these ideas to split-plot experiments. There is some discussion of this here, but that focuses on pre-post testing rather than a more general case. For concreteness, I'll focus on Yates's classic oat-yield experiment (data available as Oats in the nlme package in R). This experiment has six blocks (factor Block); each block is subdivided into 3 plots and randomly assigned to the three levels of factor Variety; and each plot is divided into four subplots and randomly assigned to the four levels of factor nitro. I have questions about two ways that these data could be analyzed.

Homogeneous mixed model

I think this is the easy question... In this model, we assume that the Block effects are iid $N(0,\sigma_B^2)$, the whole-plot effects (identify Block:Variety) are iid $N(0,\sigma_P^2)$, and the residual (subplot) effects are iid $N(0,\sigma_E^2)$. An additive model with fixed effects for Variety and nitro fits pretty well.

What I surmise is that we can estimate Cohen's $d$ values for either Variety comparisons or nitro comparisons as the observed comparison, divided by an estimate of the total SD, $\sigma_T = \sqrt{\sigma_B^2 + \sigma_P^2 + \sigma_E^2}$. Is that correct? Or do people use some other $\sigma$ as the reference (other than perhaps modeling blocks and/or plots as fixed effects)?

Multivariate model

Another possibility is to model the four observations on each plot (corresponding to the 4 levels of nitro) as a multivariate response variable. In that case, the Block effects, if modeled as random, have a multivariate distribution, and so do the errors. The plot effects are subsumed in these multivariate distributions. We can form meaningful comparisons of marginal means for Variety and nitro, as well as for combinations thereof, inasmuch as this model implicitly contains interaction effects.

But my question is, what, if anything, is a sensible reference value $\sigma$ for defining Cohen's $d$-style effect sizes? Each level of nitro has its own error variance. I can see that for certain interaction comparisons (comparing Variety with nitro held fixed at one level), this is clear-cut; but it is not at all clear to me what, if anything, defines a Cohen's $d$ for comparing levels of nitro, either marginally or at a fixed Variety. Can anybody enlighten me?

Russ Lenth
  • 15,161
  • 20
  • 53

1 Answers1

1

This is not a complete answer by any means, but I did find, from an entirely separate thread, this blog posting by Jake Westfall. I'll summarize a few key points made there (or at least my interpretation on them).

  1. "Classic" Cohen's $d$ is defined in terms of $\sigma$ being the pooled SD in a very simple one-way analysis. Westfall argues that if you're going to call something "Cohen's d", you should probably fit this simple model and estimate $\sigma$ accordingly.

  2. Later, he suggests that "classic" Cohen's $d$ has a lot to recommend it, inasmuch as these effect sizes are often used to compare effects from different studies, and this provides a unified basis for doing so.

  3. Westfall's discussion of $d_r$ (based on model residual variation) comes close to corroborating my comment on using the total error SD $\sigma_T$ in the homogeneous mixed model.

  4. For the multivariate model, I surmise based on (1) and (2) that one might estimate $\sigma$ via $\sqrt{\mbox{avg}(s^2)}$ (average the variances of the multinomial responses), as this is comparable to the pooled SD.

  5. Near the end, there is discussion of criticisms of standardized effect sizes that have been expressed by various authors (including Tukey). I thought I agreed with much of what was said, until I go to the very end where he says they are still useful for power analysis. That threw me, because I have actually written about that being inadvisable -- using a standardized effect size as the target value for a sample-size calculation gives you the same $n$, regardless of how good or bad your instrumentation or your design.

All that said, I still regard this question as unresolved. However, for purposes of my R package, I am resolved to leave the specification of $\sigma$ wide open and let the user (not me) bear as much responsibility as possible for deciding what to specify.

Russ Lenth
  • 15,161
  • 20
  • 53