I'm running a multilevel logistic regression in which I estimate the probability the authors of scientific papers will make a particular sort of error in reporting a null hypothesis significance test (specifically, an error whereby the degrees of freedom, test statistic, and $p$-value are inconsistent with each other).
The outcome is a binary variable error
indicating whether the statistic was reported correctly. The predictors are:
- type of
statistic
($t$-, $F$-, $r$-, and $\chi^2$) - the
year
in which the paper was published, and also - a binary
categorisation
that indicates whether the NHST was focal to the paper's aims or was instead peripheral (e.g. a manipulation check, or a test of some assumption).
The data has a nested structure whereby each significance test is nested in a paper, and each paper is nested in a journal.
If I ignore the nesting of articles in journals the 2-level model converges and I get results that look rather sensible.
library(lme4)
dat <- read.csv("error_data.csv")
m1 <- glmer("error ~ 1 + year + categorisation + statistic + (1 | journalID) ",
data = dat,
family = binomial(link = "logit"))
summary(m1)
I've been advised that the nesting of articles in journals should be taken into account, and this makes sense to me. However, when I run the 3-level model with that nesting included, I get a convergence error.
m2 <- glmer("error ~ 1 + year + categorisation + statistic + (1 | journalID/articleID) ",
data = dat,
family = binomial(link = "logit"))
summary(m2)
I get the message
Warning message:
In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model failed to converge with max|grad| = 4.25558 (tol = 0.002, component 1)
While calling summary(m2)
gives the somewhat reassuring message optimizer (Nelder_Mead) convergence code: 0 (OK)
there would still seem to be a problem here since the parameter estimates have changed wildly between m1 and m2, and to me they no longer make much sense.
Why is this happening, and what should I do? So that anyone reading this can reproduce the issue, I have uploaded the CSV file here. My guess is it relates to the fact that most articles have only 1 or 2 data points sampled. I'm reluctant to adopt some alternate way forward (e.g. a Bayesian solution) both because it would make things more complicated and because I already preregistered the current approach.
Update: with the help of some links @KrisBae provided, I've tried a few things, thus far to no avail. However, I mention some things that have been tried and the error messages that resulted, in case they point in the direction of a solution.
- Rescaling and/or centering the data doesn't fix the problem
- I checked and neither of the random intercepts has near-zero variance
- The fit of m2 isn't singular or near-singular
- The gradient calculations come out somewhat different using the
numDeriv
package, but I'm not sure what if anything this means - Restarting the fit did not fix the problem
- Increasing the number of iterations just caused an error message
PIRLS step-halvings failed to reduce deviance in pwrssUpdate
- Changing the optimizer to
bobyqa
for both phases resulted in another convergence failure and a messageModel is nearly unidentifiable: very large eigenvalue
Setting nAGQ=0
does cause the model to converge - according to the documentation this indicates "the number of points per axis for evaluating the adaptive Gauss-Hermite approximation to the log-likelihood". This isn't recommended by any of the documents and is just an idea that occurred to me. From other posts on this site I see setting nAGQ=0 is not generally encouraged, and there doesn't seem yet to be an answer to the question of when it is OK. I guess in addition to wanting to know why my model doesn't converge I am also curious about why setting nAGQ to 0 does make it converge.
As an experiment I tried randomly sampling 1500 of my 2086 cases and running the model with nAGQ = 1. I tried that 30 times, and found I got 22 singular fits (here the parameter estimates always seemed crazy), 5 "Model failed to converge" (here sometimes the parameter estimates seemed sensible, and sometimes not), and 3 fits that converged without warnings (here the parameter estimates always seemed reasonably sensible, but they varied a bit). Broadly similar results ensued when I instead sampled 1000 of 2086 cases.