Based on the answer here: Significance of categorical predictor in logistic regression I tried adding a "-1" to my model to fit it without an intercept, and see the correlations directly.
It looks like adding the "-1" only helps for the first of the variables, and doesn't help if there is more than one categorical value. I tried running it on "overweight ~ race + diet -1 " and then reversing the order of race and diet.
If race is 1st in the formula, then all 4 races show up as significant.
glm(formula = overweight ~ race + diet - 1, family = "binomial",
data = data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
race1 -1.17569 0.07916 -14.851 < 2e-16 ***
race2 -1.77863 0.08446 -21.058 < 2e-16 ***
race3 -1.85692 0.06967 -26.651 < 2e-16 ***
race4 -1.21037 0.07175 -16.869 < 2e-16 ***
diet2 -1.15341 0.09676 -11.921 < 2e-16 ***
diet3 -14.21256 315.57607 -0.045 0.964078
diet4 -1.36219 0.08796 -15.486 < 2e-16 ***
diet5 -2.03216 0.58765 -3.458 0.000544 ***
diet6 -14.09964 186.44637 -0.076 0.939719
When diet is first race1 is not included in the model, and race4's z value is not significant.
glm(formula = overweight ~ diet + race - 1, family = "binomial",
data = data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
diet1 -1.17569 0.07916 -14.851 < 2e-16 ***
diet2 -2.32910 0.10598 -21.978 < 2e-16 ***
diet3 -15.38825 315.57607 -0.049 0.961
diet4 -2.53788 0.09839 -25.794 < 2e-16 ***
diet5 -3.20785 0.59015 -5.436 5.46e-08 ***
diet6 -15.27533 186.44638 -0.082 0.935
race2 -0.60294 0.10888 -5.538 3.06e-08 ***
race3 -0.68123 0.09790 -6.959 3.44e-12 ***
race4 -0.03468 0.09804 -0.354 0.724
I also tried subtracting 1 from each of the categorical variables, but that didn't add diet1 into the model
glm(formula = overweight ~ race -1 + diet - 1, family = "binomial",
data = data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
race1 -1.17330 0.07915 -14.823 < 2e-16 ***
race2 -1.77969 0.08445 -21.073 < 2e-16 ***
race3 -1.85552 0.06968 -26.628 < 2e-16 ***
race4 -1.21214 0.07176 -16.892 < 2e-16 ***
diet2 -1.15544 0.09675 -11.943 < 2e-16 ***
diet3 -14.21292 315.57904 -0.045 0.964077
diet4 -1.36182 0.08796 -15.482 < 2e-16 ***
diet5 -2.01937 0.58772 -3.436 0.000591 ***
diet6 -14.09991 186.44215 -0.076 0.939717
Is there a way to fit multiple categorical variables while keeping all the "categories" in the model? Is there a reason why this shouldn't be done?
In this case, I expect race4 to be statistically significant, but when race1 is being used as the reference race4 is not statistically significant. Is there a way to avoid this?