Are degrees of freedom in lmerTest::anova correct? They are very different from RM-ANOVA

Question

I am analyzing the results of a reaction time experiment in R.

I ran a repeated measures ANOVA (1 within-subject factor with 2 levels and 1 between-subject factor with 2 levels). I ran a similar linear mixed model and I wanted to summarize the lmer results in the form of ANOVA table using lmerTest::anova.

Don't get me wrong: I did not expect the identical results, however I am not sure about the degrees of freedom in lmerTest::anova results. ~~It seems to me it rather reflects an ANOVA with no aggregation on subject-level.~~

I am aware of the fact that calculating degrees of freedom in mixed-effect models is tricky, but lmerTest::anova is mentioned as one possible solution in the updated ?pvalues topic (lme4 package).

Is this calculation correct? Do the results of lmerTest::anova correctly reflect the specified model?

Update: I made the individual differences larger. The degrees of freedom in lmerTest::anova are more different from simple anova, but I am still not sure, why they are so large for the within-subject factor/interaction.

# mini example with ANT dataset from ez package
library(ez); library(lme4); library(lmerTest)

# repeated measures ANOVA with ez package
data(ANT)
ANT.2 <- subset(ANT, !error)
# update: make individual differences larger
baseline.shift <- rnorm(length(unique(ANT.2$subnum)), 0, 50)
ANT.2$rt <- ANT.2$rt + baseline.shift[as.numeric(ANT.2$subnum)]

anova.ez <- ezANOVA(data = ANT.2, dv = .(rt), wid = .(subnum), 
  within = .(direction), between = .(group))
anova.ez

# similarly with lmer and lmerTest::anova
model <- lmer(rt ~ group * direction + (1 | subnum), data = ANT.2)
lmerTest::anova(model)

# simple ANOVA on all available data
m <- lm(rt ~ group * direction, data = ANT.2)
anova(m)

Results of the code above [updated]:

anova.ez

$ANOVA

           Effect DFn DFd         F          p p<.05          ges
2           group   1  18 2.6854464 0.11862957       0.1294475137
3       direction   1  18 0.9160571 0.35119193       0.0001690471
4 group:direction   1  18 4.9169156 0.03970473     * 0.0009066868

lmerTest::anova(model)

Analysis of Variance Table of type 3  with  Satterthwaite 
approximation for degrees of freedom
                Df Sum Sq Mean Sq F value Denom Pr(>F)
group            1  13293   13293  2.6830    18 0.1188
direction        1   1946    1946  0.3935  5169 0.5305
group:direction  1  11563   11563  2.3321  5169 0.1268

anova(m)

Analysis of Variance Table

Response: rt
                  Df   Sum Sq Mean Sq  F value Pr(>F)    
group              1  1791568 1791568 242.3094 <2e-16 ***
direction          1      728     728   0.0985 0.7537    
group:direction    1    12024   12024   1.6262 0.2023    
Residuals       5187 38351225    7394                    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

score 13 · Accepted Answer · edited May 05 '18 at 13:46

I think that lmerTest is getting it right and ezanova is getting it wrong in this case.

the results from lmerTest agree with my intuition/understanding
two different computations in lmerTest (Satterthwaite and Kenward-Roger) agree
they also agree with nlme::lme
when I run it, ezanova gives a warning, which I don't entirely understand, but which should not be disregarded ...

Re-running example:

library(ez); library(lmerTest); library(nlme)
data(ANT)
ANT.2 <- subset(ANT, !error)
set.seed(101)  ## for reproducibility
baseline.shift <- rnorm(length(unique(ANT.2$subnum)), 0, 50)
ANT.2$rt <- ANT.2$rt + baseline.shift[as.numeric(ANT.2$subnum)]

Figure out experimental design

with(ANT.2,table(subnum,group,direction))

So it looks like individuals (subnum) are placed in either control or treatment groups, and each is tested for both directions -- i.e. direction can be tested within individuals (denominator df is large), but group and group:direction can only be tested among individuals

(anova.ez <- ezANOVA(data = ANT.2, dv = .(rt), wid = .(subnum), 
    within = .(direction), between = .(group)))
## $ANOVA
##            Effect DFn DFd         F          p p<.05          ges
## 2           group   1  18 2.4290721 0.13651174       0.1183150147
## 3       direction   1  18 0.9160571 0.35119193       0.0002852171
## 4 group:direction   1  18 4.9169156 0.03970473     * 0.0015289914

Here I get Warning: collapsing data to cell means. *IF* the requested effects are a subset of the full design, you must use the "within_full" argument, else results may be inaccurate. The denominator DF look a little funky (all equal to 18): I think they should be larger for direction and group:direction, which can be tested independently (but would be smaller if you added (direction|subnum) to the model)?

# similarly with lmer and lmerTest::anova
model <- lmer(rt ~ group * direction + (1 | subnum), data = ANT.2)
lmerTest::anova(model)
##                 Df  Sum Sq Mean Sq F value Denom Pr(>F)
## group            1 12065.7 12065.7  2.4310    18 0.1364
## direction        1  1952.2  1952.2  0.3948  5169 0.5298
## group:direction  1 11552.2 11552.2  2.3299  5169 0.1270

the Df column here refers to the numerator df, Denom (second-to-last) gives the estimated denominator df; they agree with the classical intuition. More important, we also get different answers for the F values ...

We can also double-check with Kenward-Roger (very slow because it involves refitting the model several times)

lmerTest::anova(model,ddf="Kenward-Roger")

The results are identical.

For this example lme (from the nlme package) actually does a perfectly good job guessing the appropriate denominator df (the F and p-values are very slightly different):

model3 <- lme(rt ~ group * direction, random=~1|subnum, data = ANT.2)
anova(model3)[-1,]
##                 numDF denDF   F-value p-value
## group               1    18 2.4334314  0.1362
## direction           1  5169 0.3937316  0.5304
## group:direction     1  5169 2.3298847  0.1270

If I fit an interaction between direction and subnum the df for direction and group:direction are much smaller (I would have thought they would be 18, but maybe I'm getting something wrong):

model2 <- lmer(rt ~ group * direction + (direction | subnum), data = ANT.2)
lmerTest::anova(model2)
##                 Df  Sum Sq Mean Sq F value   Denom Pr(>F)
## group            1 20334.7 20334.7  2.4302  17.995 0.1364
## direction        1  1804.3  1804.3  0.3649 124.784 0.5469
## group:direction  1 10616.6 10616.6  2.1418 124.784 0.1459

Thank you @Ben Bolker for your answer. I will think over your comments and make few more experiments. I understand the `ezAnova` warning as you should not run 2x2 anova if in fact your data are from 2x2x2 design. — Jiri Lukavsky, Feb 06 '14 at 17:27
Possibly the warning that comes with `ez` could be re-worded; it actually has two parts that are important: (1) that data is being aggregated and (2) stuff about partial designs. #1 is most pertinent to the discrepancy as it explains that in order to do a traditional non-mixed-effects anova, one must aggregate the data to a single observation per cell of the design. In this case, we want one observation per subject per level of the "direction" variable (whilst maintaining group labels for subjects). ezANOVA computes this automatically. — Mike Lawrence, May 21 '14 at 17:48
+1 but I am not sure that ezanova has it wrong. I ran `summary(aov(rt ~ group*direction + Error(subnum/direction), data=ANT.2))` and it gives 16 (?) dfs for `group` and 18 for `direction` and `group:direction`. The fact that there are ~125 observations per group/direction combination is pretty much irrelevant for RM-ANOVA, see e.g. my own question https://stats.stackexchange.com/questions/286280: direction is tested, so-to-say, against subject-direction interaction. — amoeba, Jan 05 '18 at 18:41
Ben, following up on my previous comment: is it actually what you meant with "I would have thought they would be 18, but maybe I'm getting something wrong"? If so, then we are in agreement. But again, 18 agrees with RM-ANOVA and disagrees with `lmerTest` that estimates ~125 dfs. — amoeba, Jan 05 '18 at 18:53
Update to the above: `lmerTest::anova(model2, ddf="Kenward-Roger")` returns 18.000 df for `group` and `17.987` df for the other two factors, which is in excellent agreement with RM-ANOVA (as per ezAnova). My conclusion is that Satterthwaite's approximation fails for `model2` for some reason. — amoeba, Jan 07 '18 at 14:55

score 10 · Answer 2 · answered May 04 '18 at 14:05

I generally agree with Ben's analysis but let me add a couple of remarks and a little intuition.

First, the overall results:

lmerTest results using the Satterthwaite method are correct
The Kenward-Roger method is also correct and agrees with Satterthwaite

Ben outlines the design in which subnum is nested in group while direction and group:direction are crossed with subnum. This means that the natural error term (i.e. the so-called "enclosing error stratum") for group is subnum while the enclosing error stratum for the other terms (including subnum) is the residuals.

This structure can be represented in a so-called factor-structure diagram:

names <- c(expression("[I]"[5169]^{5191}),
           expression("[subnum]"[18]^{20}), expression(grp:dir[1]^{4}),
           expression(dir[1]^{2}), expression(grp[1]^{2}), expression(0[1]^{1}))
x <- c(2, 4, 4, 6, 6, 8)
y <- c(5, 7, 5, 3, 7, 5)
plot(NA, NA, xlim=c(2, 8), ylim=c(2, 8), type="n", axes=F, xlab="", ylab="")
text(x, y, names) # Add text according to ’names’ vector
# Define coordinates for start (x0, y0) and end (x1, y1) of arrows:
x0 <- c(1.8, 1.8, 4.2, 4.2, 4.2, 6, 6) + .5
y0 <- c(5, 5, 7, 5, 5, 3, 7)
x1 <- c(2.7, 2.7, 5, 5, 5, 7.2, 7.2) + .5
y1 <- c(5, 7, 7, 3, 7, 5, 5)
arrows(x0, y0, x1, y1, length=0.1)

Here random terms are enclosed in brackets, 0 represents the overall mean (or intercept), [I] represents the error term, the super-script numbers are the number of levels and the sub-script numbers are the number of degrees of freedom assuming a balanced design. The diagram indicates that the natural error term (enclosing error stratum) for group is subnum and that the numerator df for subnum, which equals the denominator df for group, is 18: 20 minus 1 df for group and 1 df for the overall mean. A more comprehensive introduction to factor structure diagrams is available in chapter 2 here: https://02429.compute.dtu.dk/eBook.

If the data were exactly balanced we would be able to construct the F-tests from a SSQ-decomposition as provided by anova.lm. Since the dataset is very-closely balanced we can obtain approximate F-tests as follows:

ANT.2 <- subset(ANT, !error)
set.seed(101)
baseline.shift <- rnorm(length(unique(ANT.2$subnum)), 0, 50)
ANT.2$rt <- ANT.2$rt + baseline.shift[as.numeric(ANT.2$subnum)]
fm <- lm(rt ~ group * direction + subnum, data=ANT.2)
(an <- anova(fm))
Analysis of Variance Table

Response: rt
                  Df   Sum Sq Mean Sq  F value Pr(>F)    
group              1   994365  994365 200.5461 <2e-16 ***
direction          1     1568    1568   0.3163 0.5739    
subnum            18  7576606  420923  84.8927 <2e-16 ***
group:direction    1    11561   11561   2.3316 0.1268    
Residuals       5169 25629383    4958                    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Here all F and p values are computed assuming that all terms have the residuals as their enclosing error stratum, and that is true for all but 'group'. The 'balanced-correct' F-test for group is instead:

F_group <- an["group", "Mean Sq"] / an["subnum", "Mean Sq"]
c(Fvalue=F_group, pvalue=pf(F_group, 1, 18, lower.tail = FALSE))
   Fvalue    pvalue 
2.3623466 0.1416875

where we use the subnum MS instead of the Residuals MS in the F-value denominator.

Note that these values match quite well with the Satterthwaite results:

model <- lmer(rt ~ group * direction + (1 | subnum), data = ANT.2)
anova(model, type=1)
Type I Analysis of Variance Table with Satterthwaite's method
                 Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
group           12065.3 12065.3     1    18  2.4334 0.1362
direction        1951.8  1951.8     1  5169  0.3936 0.5304
group:direction 11552.2 11552.2     1  5169  2.3299 0.1270

Remaining differences are due to the data not being exactly balanced.

The OP compares anova.lm with anova.lmerModLmerTest, which is ok, but to compare like with like we have to use the same contrasts. In this case there is a difference between anova.lm and anova.lmerModLmerTest since they produce Type I and III tests by default respectively, and for this dataset there is a (small) difference between the Type I and III contrasts:

show_tests(anova(model, type=1))$group
               (Intercept) groupTreatment directionright groupTreatment:directionright
groupTreatment           0              1    0.005202759                     0.5013477

show_tests(anova(model, type=3))$group # type=3 is default
               (Intercept) groupTreatment directionright groupTreatment:directionright
groupTreatment           0              1              0                           0.5

If the data set had been completely balanced the type I contrasts would have been the same as the type III contrasts (which are not affected by the observed number of samples).

One last remark is that the 'slowness' of the Kenward-Roger method is not due to model re-fitting, but because it involves computations with the marginal variance-covariance matrix of the observations/residuals (5191x5191 in this case) which is not the case for Satterthwaite's method.

Concerning model2

As for model2 the situation becomes more complex and I think it is easier to start the discussion with another model where I have included the 'classical' interaction between subnum and direction:

model3 <- lmer(rt ~ group * direction + (1 | subnum) +
                 (1 | subnum:direction), data = ANT.2)
VarCorr(model3)
 Groups           Name        Std.Dev.  
 subnum:direction (Intercept) 1.7008e-06
 subnum           (Intercept) 4.0100e+01
 Residual                     7.0415e+01

Because the variance associated with the interaction is essentially zero (in the presence of the subnum random main-effect) the interaction term has no effect on the calculation of denominator degrees of freedom, F-values and p-values:

anova(model3, type=1)
Type I Analysis of Variance Table with Satterthwaite's method
                 Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
group           12065.3 12065.3     1    18  2.4334 0.1362
direction        1951.8  1951.8     1  5169  0.3936 0.5304
group:direction 11552.2 11552.2     1  5169  2.3299 0.1270

However, subnum:direction is the enclosing error stratum for subnum so if we remove subnum all the associated SSQ falls back into subnum:direction

model4 <- lmer(rt ~ group * direction +
                 (1 | subnum:direction), data = ANT.2)

Now the natural error term for group, direction and group:direction is subnum:direction and with nlevels(with(ANT.2, subnum:direction)) = 40 and four parameters the denominator degrees of freedom for those terms should be about 36:

anova(model4, type=1)
Type I Analysis of Variance Table with Satterthwaite's method
                 Sum Sq Mean Sq NumDF  DenDF F value  Pr(>F)  
group           24004.5 24004.5     1 35.994  4.8325 0.03444 *
direction          50.6    50.6     1 35.994  0.0102 0.92020  
group:direction   273.4   273.4     1 35.994  0.0551 0.81583  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

These F-tests can also be approximated with the 'balanced-correct' F-tests:

an4 <- anova(lm(rt ~ group*direction + subnum:direction, data=ANT.2))
an4[1:3, "F value"] <- an4[1:3, "Mean Sq"] / an4[4, "Mean Sq"]
an4[1:3, "Pr(>F)"] <- pf(an4[1:3, "F value"], 1, 36, lower.tail = FALSE)
an4
Analysis of Variance Table

Response: rt
                   Df   Sum Sq Mean Sq F value Pr(>F)    
group               1   994365  994365  4.6976 0.0369 *  
direction           1     1568    1568  0.0074 0.9319    
group:direction     1    10795   10795  0.0510 0.8226    
direction:subnum   36  7620271  211674 42.6137 <2e-16 ***
Residuals        5151 25586484    4967                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

now turning to model2:

model2 <- lmer(rt ~ group * direction + (direction | subnum), data = ANT.2)

This model describes a rather complicated random-effect covariance structure with a 2x2 variance-covariance matrix. The default parameterization is not easy to deal with and we are better of with a re-parameterization of the model:

model2 <- lmer(rt ~ group * direction + (0 + direction | subnum), data = ANT.2)

If we compare model2 to model4, they have equally many random-effects; 2 for each subnum, i.e. 2*20=40 in total. While model4 stipulates a single variance parameter for all 40 random effects, model2 stipulates that each subnum-pair of random effects has a bi-variate normal distribution with a 2x2 variance-covariance matrix the parameters of which are given by

VarCorr(model2)
 Groups   Name           Std.Dev. Corr 
 subnum   directionleft  38.880        
          directionright 41.324   1.000
 Residual                70.405

This indicates over-fitting, but let's save that for another day. The important point here is that model4 is a special-case of model2 and that model is also a special case of model2. Loosely (and intuitively) speaking (direction | subnum) contains or captures the variation associated with the main effect subnum as well as the interaction direction:subnum. In terms of the random effects we can think of these two effects or structures as capturing variation between rows and rows-by-columns respectively:

head(ranef(model2)$subnum)
  directionleft directionright
1    -25.453576     -27.053697
2     16.446105      17.479977
3    -47.828568     -50.835277
4     -1.980433      -2.104932
5      5.647213       6.002221
6     41.493591      44.102056

In this case these random effect estimates as well as the variance parameter estimates both indicate that we really only have a random main effect of subnum (variation between rows) present here. What this all leads up to is that Satterthwaite denominator degrees of freedom in

anova(model2, type=1)
Type I Analysis of Variance Table with Satterthwaite's method
                 Sum Sq Mean Sq NumDF   DenDF F value Pr(>F)
group           12059.8 12059.8     1  17.998  2.4329 0.1362
direction        1803.6  1803.6     1 125.135  0.3638 0.5475
group:direction 10616.6 10616.6     1 125.136  2.1418 0.1458

is a compromise between these main-effect and interaction structures: The group DenDF remains at 18 (nested in subnum by design) but the direction and group:direction DenDF are compromises between 36 (model4) and 5169 (model).

I don't think anything here indicates that the Satterthwaite approximation (or its implementation in lmerTest) is faulty.

The equivalent table with the Kenward-Roger method gives

anova(model2, type=1, ddf="Ken")
Type I Analysis of Variance Table with Kenward-Roger's method
                 Sum Sq Mean Sq NumDF  DenDF F value Pr(>F)
group           12059.8 12059.8     1 18.000  2.4329 0.1362
direction        1803.2  1803.2     1 17.987  0.3638 0.5539
group:direction 10614.7 10614.7     1 17.987  2.1414 0.1606

It is not surprising that KR and Satterthwaite can differ but for all practical purposes the difference in p-values is minute. My analysis above indicates that the DenDF for direction and group:direction should not be smaller than ~36 and probably larger than that given that we basically only have the random main effect of direction present, so if anything I think this is an indication that the KR method gets the DenDF too low in this case. But keep in mind that the data don't really support the (group | direction) structure so the comparison is a little artificial - it would be more interesting if the model was actually supported.

+6, thanks, very interesting! A couple of questions. (1) Where can I read more about "enclosing error stratum"? I googled this term and this answer was the *only* hit. More generally, what literature would you recommend to learn about these issues? (2a) As far I understand, classical RM-ANOVA for this design corresponds to your `model3`. However, it uses `subnum:direction` as the error term for testing `direction`. Whereas here you can force this to happen only by excluding `(1|subnum)` as in `model4`. Why? (2b) Also, RM-ANOVA yields df=18 for `direction`, not 36 as you get in `model4`. Why? — amoeba, May 05 '18 at 13:44
For my points (2a+2b), see `summary(aov(rt ~ group*direction + Error(subnum/direction), data=ANT.2))`. — amoeba, May 05 '18 at 13:47
(1) The topic of error strata and which terms are enclosed in which strata are derived from the Expected Mean Square expressions for a given model/design. This is "standard" Design of Experiments (DoE) material though these more technical topics are often dropped in easy-going ("applied") variants of such courses. See for example ch 11&12 in http://users.stat.umn.edu/~gary/book/fcdae.pdf for an introduction. I learned the topic from D C Montgomery's equivalent text and the extensive extra materials from the (recently and regrettably) late Professor Henrik Spliid. — Rune H Christensen, May 05 '18 at 18:26
... For a more thorough treatment Variance Components (1992 and 2006) by Searle et al is a classic. — Rune H Christensen, May 05 '18 at 18:28
Ahh, yes, I should have seen that: if we have a model in which both `subnum` and `subnum:direction` are non-zero then `anova(lm(rt2 ~ group * direction + subnum + subnum:direction, data = ANT.2))` gives 18 df for all three factors and this is what the KR-method picks up. This can be seen already with `model3` where KR gives the design-based 18 df for all terms _even_ when the interaction variance is zero while Satterthwaite recognizes the vanishing variance term and adjusts the df accordingly.... — Rune H Christensen, May 05 '18 at 19:44
... If the interaction variance is non-zero both methods give 18 df for all three terms - the same is true for model2. The difference in how zero-variance terms are handled may be inherent method differences or 'just' implementation differences. — Rune H Christensen, May 05 '18 at 19:45
RM-ANOVA is in my experience one of the more ambiguous terms in statistics. In this design we do not (I assume) have the restriction on randomization that is present with the 'time' variable in RM-ANOVAs. Sometimes the F-tests are adjusted for such randomization restrictions as is also done in split-plot designs. I am not familiar with how `aov` constructs its tests but it may (1) assume a different design, or (2) adjust its tests under the assumption of restrictions on the randomization. I would not be surprised if the `Error()` capabilities in `aov` were made for split-plot designs. — Rune H Christensen, May 05 '18 at 20:05
Hmm. I don't quite know what you mean by "the restriction on randomization", but I was under impression that a design with one between-subject factor (`group`) and one within-subject factor (`direction`) *is* exactly the same thing as split-plot design. In fact, I've definitely seen this explicitly stated in different places on our forum and elsewhere online. Is it not the case?! — amoeba, May 05 '18 at 20:41
No, a defining feature about split-plot designs is different levels of randomization or the restrictions on the randomization scheme. It is not enough to look at the data structure; you have to know how the randomization was done. See chapter 16 in the book I linked to above. — Rune H Christensen, May 05 '18 at 21:01
Hmm. I read the section 16.1. Now I understand what the "restriction on radomization" refers to. However, I still don't see why this design is not a split-plot. Subjects are randomly assigned to the two groups (subjects are whole plots). And each subject experiences two directions, presumably in random order. So here are two randomizations, exactly as in the split-plot; and there is the same "restriction". It still seems to me that it's a split-plot. Perhaps you meant that if the time order of the within-subject factor levels is not randomized *then* it's not a split-plot; that's a good point. — amoeba, May 05 '18 at 21:31
I have awarded the bounty to this answer now, but I would still be very interested in clarifying this issue about split-plot. — amoeba, May 11 '18 at 14:27

Are degrees of freedom in lmerTest::anova correct? They are very different from RM-ANOVA

2 Answers2

Concerning model2

Linked