Chosing the reference group in a regression with a non-ordinal predictor

Question

When running a model with a categorical predictor, the output contains n-1 p.values where n is the number of levels in the factor. One level is taken to be the reference group.

For example: I observed plants in different environments and want to know if the substrate1, substrate2 and substrate3 (categorical non-ordinal predictors) influence the height, number of leaves and weight (response variables) of the plants.

My question is: Which group should be the reference? Note: None of the group might be consider as a control group or as a normal group while the others are differ from what is normal. For example I will run the three models.

Heigth ~ substrate1 + ... weigth ~ substrate1 + ... Number_of_leaves ~ substrate1 + ...

substrate1 is made of 5 groups: 'a','b','c','d','e'. It non-ordinal!

Which group of substrate1 would you suggest me to choose as the reference group for each model? - One of the two extreme group (even if it is not the same group in each model) - The group which is the closest to the overall mean (even if it is not the same group in each model) - The group which contain the greatest number of observations - Any group but it is better to always chose the same group

Here are some graphs to make more sense of my question. Should I always choose group 'c' or should I chose 'c' for the first model, 'd' for the second model and 'e' for the last model. Or other...

enter image description here

score 1 · Accepted Answer · answered Dec 06 '13 at 17:14

1

Regardless of what level you set as reference, the resulting model fit will be equivalent.

You are probably interested in if there are there are mean differences in the substrate levels. After fitting the model you should run a Tukey Post-Hoc comparison to see which levels differ. TukeyHSD function in R. Again, it does not matter on what the reference level was.

Also, I'm not sure what you are showing in the bar plots (are those the mean estimates?). It would make more sense to produce boxplots for each level.

answered Dec 06 '13 at 17:14

Glen

6,320
4
37
59

ah OK, it does not change anything. The barplots display means of each group as you said! I would have aimed to add significant signs (stars) at the top of each bar (or box as a boxplot would be a better idea). Would it make sense given that one group has no associated p.value ? Thank you for your help @Glen – Remi.b Dec 06 '13 at 17:33
I think I found the answer to the quesiton in the comment here: http://stats.stackexchange.com/questions/20125/highlighting-significant-results-from-non-parametric-multiple-comparisons-on-box. I've got to display all statistically similar boxplots by letters! Thank you @Glen – Remi.b Dec 06 '13 at 17:45

Chosing the reference group in a regression with a non-ordinal predictor

1 Answers1