Selecting best model before considering p-values for GLMM?

Question

I am trying to run a binomial GLMM using the glmer command in R.

It is not clear to me whether I need to first select the best model or if I should just report the p-values that I get with the summary(my.model) command.

If I need to select the best model first, which is the best option in R to compare two binomial GLMM built with glmer? Is anova(my.model1,mymodel2,test="Chisq") a valid way of comparing my models?

Thanks for your help, Jade

EDIT: thanks so if I understand I have to stick to the model that I defined when I planned my experiment and not get rid of any variables, even if they do not significantly improve my model (by comparing AIC for example)?

Please register &/or merge your accounts (you can find information on how to do this in the **My Account** section of our [help]), then you will be able to edit & comment on your own question. — gung - Reinstate Monica, Jun 05 '17 at 00:48

score 1 · Answer 1 · answered Jun 04 '17 at 16:32

p-values are predicated on your having chosen the model before you observed any data. If you choose the model (or any transformations etc.) based on the data and then calculate p-values using the same data you used to select the model, then your p-values will be biased low, towards statistical significance. This can be considered a case of "p-hacking", see this question: How much do we know about p-hacking “in the wild”?

Choose your model before you collect the data that you use to calculate p-values.

It's not hard - and very enlightening - to simulate model selection and how far too many "significant" p-values can result from it, even in the total absence of any true effect.

(Alternatively: collect some data, then play around with them to determine a well-fitting model, then independently collect more data and calculate p-values with your chosen model and the new data. This decouples model choice and data collection.)

Selecting best model before considering p-values for GLMM?

1 Answers1