1

Is it possible to overfit a model by virtue of having too many categorical variables?

I have 3 categorical variables and my dependent measure is continuous (or a ratio I guess, I'm measuring accuracy/error rates).

mod4 <- lmer(accuracy ~ group + session + trialtype + 
                 session:trialtype + (1 | subject),       
             REML = FALSE, data = data)

mod5 <- lmer(accuracy ~ group + session + trialtype + 
                session:trialtype + (trialtype | subject), 
             REML = FALSE, data = data)

mod 4 I don't get any warning or error, and mod 5 I get the boundary (singular) fit message. I've included the trialtype as a random slope to replicate previous data.

When I've done an ANOVA on the two models, mod4 has lower AIC.

Edit:

I have 3 groups, 30 subjects in each group. Each subjects does a pre- and post-test task. In this task, there are 4 trial types. So I have 8 observations (4 trials, 2 session) per subject. There is complete date, nothing missing. Also, accuracy is being measured 0-1. No subjects have accuracy of 0. So they're ratios.

Dimitris Rizopoulos
  • 17,519
  • 2
  • 16
  • 37
CogNeuro123
  • 137
  • 6
  • 1
    Please provide information on the numbers of groups, sessions, trial types, and subjects, and whether there are complete data for all the combinations. Also, if you have accuracy rates expressed as a ratio of correct results to total attempts, each individually an integer, you might want to consider a generalized linear model of a type that handles counts. For example, you could use a Poisson or related model for the numbers of correct results, with the total attempts as an offset. More details about your design and what you are trying to accomplish would help. – EdM Aug 15 '19 at 20:39
  • @EdM thank you, I've updated! – CogNeuro123 Aug 16 '19 at 04:26

1 Answers1

3

When you include the term (trialtype | subject) in your model, and given that trialtype is a categorical variable with four levels, you include four separate correlated random effects per subject (i.e., you postulate an unstructured covariance matrix for your random effects). This is a complex model with many variance components, and it's no surprise that it does not converge given the size of your dataset.

Moreover, it seems that you have a bounded outcome variable in the $(0, 1)$ interval. The linear mixed model that you fit with lmer() has normal error terms and will not respect these bounds. You could instead consider fitting a Beta mixed effects model. This is, for example, available in the GLMMadaptive package I’ve written; you can find a worked analysis here.

Dimitris Rizopoulos
  • 17,519
  • 2
  • 16
  • 37