Mixed models and backward elimination

Question

Let's say I have a data like this, and I'm trying to build a mixed model.

studentId   | courseId | courseName | year | courseGroup | timespent | count | mark
stud1       | 19       | M101       | 2008 | F           | 12.3      | 23    | 3.7
stud1       | 21       | E102       | 2008 | C           | 2.3       | 15    | 4
stud1       | 109      | H300       | 2008 | E           | 22.3      | 5     | 3
stud2       | 19       | M101       | 2008 | F           | 3.3       | 45    | 3
stud2       | 21       | E102       | 2008 | C           | 12.3      | 56    | 3.3
stud3       | 200      | M101       | 2009 | F           | 12.3      | 21    | 3.7

the full model would be:

lmer.model.full <- mark ~ courseGroup + timespent + count + courseGroup:(timespent+count) + (1|studentId) + (1|courseName/courseId)

Running the model results with the following summary:

Fixed effects:
                         Estimate Std. Error         df t value Pr(>|t|)    
(Intercept)             3.909e+00  1.282e-01  8.089e+02  30.491  < 2e-16 ***
courseGroupE           -2.404e-01  1.835e-01  6.417e+02  -1.311  0.19049    
courseGroupF           -1.105e-01  1.493e-01  2.246e+02  -0.740  0.46008    
timespent              -1.552e-02  5.065e-02  1.872e+03  -0.306  0.75932    
count                  -5.244e-02  5.409e-02  1.869e+03  -0.969  0.33244    
courseGroupE:timespent  8.740e-02  5.184e-02  1.823e+03   1.686  0.09196 .  
courseGroupF:timespent  2.350e-03  3.992e-02  1.881e+03   0.059  0.95308    
courseGroupE:count     -6.546e-02  2.673e-02  1.158e+03  -2.449  0.01446 *  
courseGroupF:count     -7.015e-02  2.470e-02  1.373e+03  -2.840  0.00457 **

In order to proceed with the model selection using backward elimination, I should continue by removing the one with the highest p-value, which is timespent. But, the interaction between timespent and courseGroup is marginally significant, might become significant in later iterations. On the other hand, p-value for interaction between F group and timespent is 0.95. What should I do in this case and how to proceed further?

One more thing that puzzle me... what should I be doing in this case:

 Fixed effects:
                             Estimate Std. Error         df t value Pr(>|t|)    
    (Intercept)             3.909e+00  1.282e-01  8.089e+02  30.491  < 2e-16 ***
    courseGroupE           -2.404e-01  1.835e-01  6.417e+02  -1.311  0.16049    
    courseGroupF           -1.105e-01  1.493e-01  2.246e+02  -0.740  0.96008    
    timespent              -1.552e-02  5.065e-02  1.872e+03  -0.306  0.25932    
    count                  -5.244e-02  5.409e-02  1.869e+03  -0.969  0.33244    
    courseGroupE:timespent  8.740e-02  5.184e-02  1.823e+03   1.686  0.09196 .  
    courseGroupF:timespent  2.350e-03  3.992e-02  1.881e+03   0.059  0.25308    
    courseGroupE:count     -6.546e-02  2.673e-02  1.158e+03  -2.449  0.01446 *  
    courseGroupF:count     -7.015e-02  2.470e-02  1.373e+03  -2.840  0.00457 **

Should I remove courseGroup from the model, leaving only with timespent and count, or remove count? What would be the best indicator that courseGroup should be excluded from the model?

To state the obvious: Model selection based on $p$-values is a bad idea and it should be avoided. The fact you calculate $p$-values on a mixed-effects model make this choice even more questionable. How did you get these $p$-values? I think that standard `lmer` output doesn't offer this information. — usεr11852, Jan 04 '15 at 00:14
With lmer from lmerTest package... what else would you suggest? I was thinking to try with step() function from the same package, since it was giving an interesting results previously. I understand the whole point of p-values and mixed models, but what would be the other approach? — Srecko, Jan 04 '15 at 01:19
Not sure why haven't tried before, but seems like step() function really gives "reasonable" model. All I did is initially specified the full model, that included all fixed and random effects, and called the function passing it the full model. If you have any better ideas, please let me know... — Srecko, Jan 04 '15 at 01:58
Thank you for this additional information. Please see my answer below. — usεr11852, Jan 04 '15 at 12:07
Stepwise selection is a *bad* idea, especially using p-values that are unreliable for mixed models. See [here](http://stats.stackexchange.com/a/6911/35989), [here](http://www.stata.com/support/faqs/statistics/stepwise-regression-problems/) and `?lme4::pvalues`. — Tim, Jan 04 '15 at 12:12
thanks Tim, I'll try something from the answer below... guess I got used from the other approaches, so although I know it's not good, p-value is always the first thing to check :) — Srecko, Jan 04 '15 at 17:28

usεr11852 · Accepted Answer · 2015-01-04T12:17:34.103

The $p$-values you are getting are based on Satterthwaite's approximation of degrees of freedom. This approximation is conducted because the degrees of freedom (DFs) in the context of a mixed-effects model are not obvious. Other people might advocate different approximations (eg. Kenward-Roger approximation); realistically speaking seeing the DFs you are looking at (200+ most of the time), the difference you will get from one approximation to another they will be quite insignificant. Your $p$-values will essentially be $z$-values as DFs >= 500. To state the obvious though: $p$-values are not the probability of making a mistake by rejecting a true null hypothesis. Therefore a small value does not mandate that the associated variable should be included.

Having said the above please see the following thread Algorithms for automatic model selection. You are exactly in the same situation. If you really want to go down the road of a stepwise (linear mixed-effects) regression please use AIC; this option will try to somewhat penalize your model's complexity. Some standard caveats about the use of AIC on mixed-effects model can be found in the FAQ of http://glmm.wikidot.com. You should not use AIC's "vanilla" version directly. Please look at Greven and Kneib, 2010 regrading this; they present a corrected cAIC. They also provide an R package implementing the corrected cAIC they outline.

In general I would suggest you look into cross-validation techniques and/or look at boostrapping your model. Check the functionality around lmer's bootMer, inspect the confidence intervals of your parameters; it should give you a better idea of what is worth including in your final model. I would argue that even jack-knifing is better than $p$-value selection.

Finally while I "get" the idea of variable selection in terms of fixed-effects, in terms of random-effects the whole idea appears to me as a simple data-dredging technique. The random effects are suppose to be based on the research question. Otherwise one simply cherry-picks an error structure trying to "squeeze more significance out of the remaining terms" (glmm.wikidot's FAQ once more).

Thanks! A great answer! Seems like I have some reading to do... :) — Srecko, Jan 04 '15 at 17:25
Variable selection based on $P$-values, AIC, BIC, Cp, etc. will bias every aspect of the model, especially the residual variance. — Frank Harrell, Jan 04 '15 at 18:47
Thanks Frank, but what would you suggest? In the example above, I pretty much simplified the data... there are actually 4 counts and 4 times (everything is based on theory) now I want to predict the _mark_. How can I choose the best model? I know theoretically what might work, but I want to show what else might be a good predictor as well... — Srecko, Jan 04 '15 at 20:02

Mixed models and backward elimination

1 Answers1