How to test if a linear mixed model (mixedlm) is overfitting in Python?

Question

I've read several threads about overfitting but still couldn't find a reply that seems to be applicable to my case...

I use statsmodels' mixedlm to try studying the effects of 4 predictors on participants' positive emotion ratings. To be specific, the 4 predictors are doubt_rating (potential moderator), time (repeated measures; two timepoints), event (categorical variable; 2 levels; between-subjects), and style (categorical variable; 2 levels; between-subjects). allCondition_byTime is the name of the dataset.

Here is the model I use to detect main effects of the predictors and interactions between them:

mod = smf.mixedlm("positive_rating ~ doubt_rating + time:doubt_rating + event:doubt_rating + style:doubt_rating + time:event:doubt_rating + time:style:doubt_rating + event:style:doubt_rating + time:event:style:doubt_rating", allCondition_byTime, groups=allCondition_byTime["id"]).fit()

There're 4 predictors in the model so I'm worried that it may be subject to overfitting. I know that there's kfolds cross-validation which may be helpful, but I just don't know how to apply it to my dataset.

Any comments and advice are highly appreciated!

Additional information:

There're 160 participants in total, but because I used time as one of the factors, so each ID has 2 rows and each row holds the same data except for positive emotion rating (it was measured pre- and post-test). Below is a screenshot of what each row looks like (ignore share_freq, negative_rating, general_rating):

And this is the output of my mixedlm:

An interesting question. Can you expain a bit more how many IDs there are ana how the data structure looks like for a single ID? How many parameters does your model currently estimates and how many rows do you possess? — Michael M, Jan 24 '22 at 18:26
@MichaelM I have just added some more information above. Hope it makes my question clearer now. — Gloria Ma, Jan 24 '22 at 19:43
I'm not fluent in Python, but it seems that your model doesn't provide individual coefficients for `time`, `event`, or `style`. That can make it hard to interpret the other coefficients. Was that a deliberate choice? — EdM, Jan 24 '22 at 21:20
@EdM actually the main analyses are about the effects of time, event and style on positive ratings. Based on that model, I only found a significant effect of time on positive ratings. Therefore, when I took the analyses further by inserting an additional variable in the model (that's doubt_rating, which I want to know if it has any moderating effects), I only focused on the possible interactions it has with the original 3 predictors. Would it solve my overfitting problem if I just involve time (the only significant effect observed from the primary model) and doubt_rating in the model? — Gloria Ma, Jan 25 '22 at 00:59

EdM · Accepted Answer · 2022-01-25T16:15:11.353

The results you display might indicate no significant result at all. With interactions up to fourth order, your "4 predictors" end up being more than that. Your model shows 8 regression coefficients beyond the intercept (from the perspective of overfitting, that's already 8 effective predictors). Although one coefficient beyond the intercept is nominally "significant," the overall distribution of p-values among coefficients doesn't seem that far from the uniform distribution you would get if there were no real non-zero coefficients.

Furthermore, your coding has suppressed individual coefficients for several predictors, which is typically not a good thing to do. A model with all individual coefficients and interactions would have many more. There's a big risk in just plugging in all high level interactions with a small data set: it might make it hard to find important true relationships as the "significance" reports necessarily take the number of estimated coefficients into account. Plus, it's really hard to wrap your mind around a 3-way interaction, let alone a 4-way interaction.

You have to decide what's most important to glean from your data. For example, it seems that your main interest is in the emotion-rating difference between the 2 time points, given the other predictors. In that case, you could simplify your model by just using the emotion-rating difference for each participant as the outcome, similar to paired t-tests. Then, with only 1 effective outcome per participant, you wouldn't need a mixed model.

You also have to decide how many interactions it makes sense to evaluate. If you use paired differences to remove time as a predictor, you still have a binary event, a binary style, and a continuous doubt_rating predictor. If you treat the continuous predictor linearly, that gives you 3 individual coefficients, 3 two-way interactions, and a 3-way interaction, for a total of 7 predictors. A rule of thumb for studies like this is that you can handle about 1 coefficient estimate per 10-20 observations without overfitting. Although you would have cut your number of effective observations from 320 to 160 by using paired differences, you would still be in a useful case/coefficient range.

To answer the specific question about how to test for overfitting: a good way is to use bootstrap validation of your model building process. The "optimism bootstrap" repeats the modeling on multiple bootstrapped samples of your data, evaluates each model on the full data set, and calculates how much better each model fits its corresponding bootstrap sample than it does the full data set (the "optimism" of each model). The averaged optimism among those models estimates the optimism in applying the full model to the underlying population from which you took the original data sample. An overfit model will be highly "optimistic" in that sense. See Chapter 7 of Elements of Statistical Learning, especially Section 7.4.

A final warning: do not rely solely on model summaries like the one you show when there are interactions, particularly interactions with continuous predictors whose values are far from 0. With standard effect/dummy coding, the coefficient estimate (and the "significance" of its difference from 0) for a predictor involved in an interaction (or for an interaction term that has higher-level interactions) is for the situation when all the interacting predictors are at reference levels (categorical interactors) or 0 (continuous interactors). That might not represent any real scenario consistent with your data. Instead, examine and compare predictions based on realistic combinations of predictor-variable values.

Thank you so much! I wonder if a simple model comparison can do the trick by comparing model1 = sm.OLS.from_formula('positive_rating ~ doubt_rating * time * event * style', data=allCondition_byTime).fit() to model2 with the same structure but without doubt_rating, then running anova_results = anova_lm(model1, model2) to see the difference. — Gloria Ma, Jan 26 '22 at 16:19
@GloriaMa if your models are nested and properly specified, then a likelihood-ratio test is a good way to see if adding a particular predictor helps. You might have to be careful with mixed models; see [this page](https://stats.stackexchange.com/q/143763/28500) and its links under "Related." If you include all possible interactions of `doubt_rating` with other predictors as you propose, the large number of new coefficients might lead to a risk of missing something important. Think carefully if there are some specific interactions that are most important to evaluate. — EdM, Jan 26 '22 at 17:07

How to test if a linear mixed model (mixedlm) is overfitting in Python?

1 Answers1