How does data sparisty affect the predictive abilities of mixed linear models?

Question

I am new to statistical modelling and I have a potentially silly question. I've been working with a mixed model where the design matrix of one of the categorical random predictors (r_id) is sparse: i.e. typically each level of the predictor is only associated with a couple of data points.

My model equation is:

model_set1 <- asreml(fixed = mean_score ~ 1 + sh_count + yob + sex, random = ~ vm(an_id, ainv) + idv(r_id), residual = ~ idv(units), data = df)

The number of observations in mean_score is 896. The number of levels in r_id is 664. Some r_id's have multiple mean_scores (range 1 to 7, mean 1.36) and some mean_scores have multiple r_ids (range 1 to 5, mean 1.32).

I was somewhat surprised when the variance estimate for this was quite large (32% of mean_score's variance). Is this likely to be because my model is overfit to the data? Intuitively I feel that this variable is not very informative because we don't have very much data on which to estimate the effect of a single level of the predictor, but perhaps my intuition is wrong.

I am interested to know if those experienced running linear mixed models would even chose to include such a predictor in their model. I am interested in more getting an intuition for this issue (pointers to textbooks/other resources welcome!), that will extend beyond this one example.

(Perhaps 'sparse' is not the best way to describe this and thus this is why I struggled to find an answer elsewhere, and if not, please correct me).

Welcome to the site. It's a bit hard to answer without more info. What is the overall size of the dataset ? How many levels does the variable in question have in total ? Is this variable a grouping factor that you are fitting random intercepts for ? What other variables are in the model ? What is the model formula ? Please include the model output. If working in R please also post the output of `str(mydata)`. It sounds like you have small cluster sizes but that is not necessarily a problem nor should it have anything to do with the variance estimate (why do you think it is quite large ? — Robert Long, Jul 14 '20 at 13:20
@RobertLong Thank you for your feedback. I have updated my question a bit, but your description of my problem in terms of cluster size was actually probably help enough - when I look this up using that terminology I get the answers I'm looking for. For example, I think my question is answered here: https://stats.stackexchange.com/questions/388937/minimum-sample-size-per-cluster-in-a-random-effect-model — silver arrow, Jul 14 '20 at 15:58

score 3 · Accepted Answer · answered Jul 14 '20 at 16:31

It is not possible to say whether for id to account for 32% of the variation in your outcome is too high. In general unless you simulated the data or you know a priori for some other reason (for example previous studies or other domain knowledge) what the expected variance is, you can't really know what to expect.

Certainly it seems that you have a lot of singleton clusters but the accepted answer to the linked question you mentioned in the question comments says that the minimum cluster size is 1 with a few caveats. You could try to adapt the code there to give insight into your particular situation.

One other note:

I was somewhat surprised when the variance estimate for this was quite large (32% of mean_score's variance)

A better comparison is to compare the variance to the residual variance (plus any other variance components), not to the variance of the outcome.

Thank you so much! Yes, I think I worded that wrong, I calculated 32% as the variance explained by `r_id` over the sum of the residual variance and the variance explained by both random effects (`r_id` and `an_id`). I will definitely have a go at working through the example in the linked question. — silver arrow, Jul 15 '20 at 09:44

How does data sparisty affect the predictive abilities of mixed linear models?

1 Answers1