4

I currently have some data but am unsure how I can model it using a mixed model. My data/variables are as follows:

Test(factor) - A collection of 5 tests that every person has taken. As in, each person sat down for an hour and took 5 different tests, I have separate scores for each test they took. So thing like age will be constant here.

Value - The scores on the tests

Status(factor) - Whether the person smokes or doesn't smoke

Age - 15-70

Gender(factor) - M/F

I am wondering how i can use a mixed model here to help me determine how test performance as associated with age, gender, and status. This is what I have, but I do not think this is correct. If you have any ideas, they would be greatly appreciated!

lmer(value ~ status + age + gender+(status + age+ gender |test), data = df)
Robert Long
  • 53,316
  • 10
  • 84
  • 148
halo2e
  • 61
  • 1

2 Answers2

4

You are correct that a mixed model can be used here.

You are also correct to specify random intercepts for test, although some would argue that 5 is insufficient to model it as random. There is no clear answer to this, but it would appear that the tests can be thought of as coming from some wider population of tests, so it is appropriate to fit such a model in the first instance. The reason for specifying random intercepts here, is that the results for each test are likely to be more similar to each other than to other tests. That is, the results within each test will be correlated. A mixed model with random intercepts is one way to control for this. The alternative to this is to include test as a fixed effect, and you might want to fit different models, treating test as fixed or random, and compare the results.

However you also have repeated measurements within subjects, so you need to allow for this, for the same reasons as for test.

You have also specified random slopes for status, age and gender, as well as fixed effects for these. What this means is that you want to estimate a fixed effect, but you also want this effect to vary for each test. So, the software will also estimate a variance for these random effects. You should ask yourself whether this actually makes sense, and whether or not you want to estimate these variances, or just an overall estimate (ie, just a fixed effect). Also, note that while it is fine to specify random slopes over levels of the test factor (subject to the forgoing caveats), it is not at all fine to specify random slopes for these variables over levels of subjects, as noted in the answer by @ErikRuzek because these variables do not vary by subject. Doing so should actually produce an error, or a warning of a singular fit if you tried to do so.

Additionally, most software will assume that random effects follow a multivariate normal distribution and will also estimate the correlations between them. Depending on how many subjects are in your sample, the data may not support such a complex random structure.

I would therefore suggest the following model:

value ~ status + age + gender + (1 | test) + (1 | subject) 

You might also want to allow for a non linear effect of age, by fitting higher order terms, or splines, and you might also want to allow for interactions between age, status and gender.

Finally, note that the fixed effects in a mixed model are generally conditional on the random effects, so the estimates are for the same test, not averaged over all tests. You might want the average effect (marginal effect) instead, and if so you need to choose software which can do this (eg the GLMMadaptive package in R.)

Robert Long
  • 53,316
  • 10
  • 84
  • 148
  • 1
    Interesting to read your thinking on treating test as a random effect @RobertLong. In my answer I thought with just 5 levels, it was a bit of stretch. But I agree that comparing models and thinking about whether the assumption of normality of the random effect makes sense is important for the analyst. Also nice to point out the marginal effect possibility from `GLMMadaptive`. +1 – Erik Ruzek Jan 31 '20 at 16:27
  • @ErikRuzek indeed, that's why I said that some might not agree ;) I am persuaded to treat a low number of levels as random when it seems obvious that they can be thought of as coming from a bigger population, and the other thing is that Doug Bates, primary author of `lme4` uses as few as 4 in his [book](http://webcom.upmf-grenoble.fr/LIP/Perso/DMuller/M2R/R_et_Mixed/documents/Bates-book.pdf). So my absolute minimum is 4 :D – Robert Long Jan 31 '20 at 16:41
  • I feel that a step might be missing here, namely are the tests a sample from a population of potential tests or are they the entire population of tests of interest? This point wasn't clear in the question for me. An answer in favour of the second option here would make any question around the number of levels required for random factors moot as test would then be a fixed factor. Personally, I wouldn't recommend looking at this with test as both a random and as a fixed factor as these start from different views of the research question. – user215517 Aug 11 '21 at 21:14
  • @user215517 the problem I see is that with a small number of levels of the grouping factor, it can be problematic to fit a mixed model where the random effects are assumed to be normally distributed. There are times when it's absolutely fine, and others when it's a gross mistake. Fitting seperate models aids to help ensure that inferences are correct. The rule of thumb about a factor being sampled from a wider population is just that, a rule of thumb, and there are many competing considerations when deciding to model a factor as random or fixed - and they often conflict with each other. – Robert Long Aug 11 '21 at 21:22
  • @RobertLong The question of whether "test" should be a fixed or a random effect here starts with the question of whether this is the entire population of tests of interest (which is plausible given the information above, c.f. comparing two countries) or whether this is a sample of possible tests (which is also plausible, c.f. a sample of countries). This isn't a heuristic here any more than it would be when asking whether ethnicity or country, etc. would a random sample of levels or not. The question of whether there are sufficient levels is only relevant under the second interpretation. – user215517 Aug 11 '21 at 22:39
  • @RobertLong I'd also worry that fitting separate models inflates Type I error rates (c.f. normality testing for deciding between parametric and non-parametric tests). We might want a variable to be random rather than fixed but feel that with only two or three levels this presents issues, but even there I wouldn't look at both as equally valid choices. – user215517 Aug 11 '21 at 22:41
  • @user215517 of course, I would always adjust p-values for multiple testing. As for random vs fixed, there is an enormous literature on this. As I mentioned, there are competing considerations. Take planned experiments with blocking, for example. It is normal to fit random intercepts for blocks (provided there are enough) even though the blocks can't be thought of as samples. See Doug Bates (primary author of `lme4` for R and probably the most knowledgable person on the planet as far as mixed models are concerned) work on fitting mixed models to experimental data for example. /cont – Robert Long Aug 11 '21 at 22:57
  • Another example is the use of mixed models for health data where every hospital in the whole country is included. Such modelling is completely normal - we wouldn't want a model with 2000 (in the UK) fixed effects for hospitals. Also, I would never advocate fitting random intercepts with only 2 or 3 levels of the grouping factor. – Robert Long Aug 11 '21 at 23:00
  • [This is a very interesting Q&A](https://stats.stackexchange.com/questions/4700/what-is-the-difference-between-fixed-effect-random-effect-and-mixed-effect-mode/151800#151800). The accepted answer has a quote from Andrew Gelman detailing some of the competing consideration I mentioned above. – Robert Long Aug 11 '21 at 23:07
0

Status, age, and gender are characteristics of subjects, and you cannot have them as random slopes in the mixed model. Only variables that vary within subjects can be included as random slopes in this framework. All individuals were exposed to the same test, and I would likely treat that as a fixed factor rather than a random factor. What does not appear in your current model, but is critical, is the subjectID variable. This is the level 2 unit which you are repeatedly measuring. Thus I would suggest you model this more like the following:

m1 <- lmer(value ~ status + age + gender + test + (1|SubjectID), data = df)

This will give you estimates of the outcome mean difference between smokers and non-smokers (status), the difference in the outcome for individuals who differ by 1 year of age (age), the male-female difference in the outcome, and then the average score for each of the tests. In addition, you will get two random effects - the intercept for SubjectID, telling you how Subjects vary around the grand mean (given by the intercept in the fixed portion of the model) and residual, telling you how much individuals vary around their personal mean. These represent between- and within-subject variance, respectively.

Erik Ruzek
  • 3,297
  • 10
  • 18