I am new to statistical modelling and I have a potentially silly question. I've been working with a mixed model where the design matrix of one of the categorical random predictors (r_id
) is sparse: i.e. typically each level of the predictor is only associated with a couple of data points.
My model equation is:
model_set1 <- asreml(fixed = mean_score ~ 1 + sh_count + yob + sex, random = ~ vm(an_id, ainv) + idv(r_id), residual = ~ idv(units), data = df)
The number of observations in mean_score
is 896. The number of levels in r_id
is 664. Some r_id
's have multiple mean_score
s (range 1 to 7, mean 1.36) and some mean_score
s have multiple r_id
s (range 1 to 5, mean 1.32).
I was somewhat surprised when the variance estimate for this was quite large (32% of mean_score
's variance). Is this likely to be because my model is overfit to the data? Intuitively I feel that this variable is not very informative because we don't have very much data on which to estimate the effect of a single level of the predictor, but perhaps my intuition is wrong.
I am interested to know if those experienced running linear mixed models would even chose to include such a predictor in their model. I am interested in more getting an intuition for this issue (pointers to textbooks/other resources welcome!), that will extend beyond this one example.
(Perhaps 'sparse' is not the best way to describe this and thus this is why I struggled to find an answer elsewhere, and if not, please correct me).