Variables significant when changing reference category?

Question

I'm doing a research about risk tolerance and demographic factors using a logistic regression.
When I first included generation (Z Y X BB) in my model with the baseline of BB, Z Y X yield insignificant result. When I changed the reference category to Z, the other generations: Y X BB were all significant at 1%.
Can anyone tell me why and how should I deal with this problem?

was the reference cateogry BB when results were insignificant? — Sathya Ih, Aug 06 '20 at 11:02
yes, and when ref category was Z, the results became significant for all of the generation — Long Duong, Aug 06 '20 at 11:18
Some similar questions with answers: https://stats.stackexchange.com/questions/228672/why-does-changing-how-i-code-my-dummy-variable-change-significance/228703, https://stats.stackexchange.com/questions/157475/model-change-after-switching-reference-level-in-r-logistic-regression-model-with, https://stats.stackexchange.com/questions/60817/significance-of-categorical-predictor-in-logistic-regression — kjetil b halvorsen, Aug 06 '20 at 16:54

score 1 · Answer 1 · answered Aug 06 '20 at 13:18

1

This represents the (somewhat confusing) way that regression results involving multi-level categorical predictors are typically reported.

Your regression evidently used treatment coding for the 4-level categorical predictor generation. That chooses one level of the predictor as a reference. The reported regression coefficients (and their p-values) then are for differences of each of the other levels from that particular reference level. So it's not surprising that the individual significance indicators change as you change the reference level.*

This way of reporting results for individual levels of the predictor doesn't represent the overall significance of generation, including all 4 of its levels. That's probably what you're most interested in. For that you need to compare a model that includes all the predictors against one in which you have removed generation completely, and see if the models are significantly different. That's typically done with an analysis of variance comparison between the two models. If you thus find generation to be significant overall then you can use standard post-hoc tests to compare among the individual levels.

*I am a bit surprised that "BB" was significant when "Z" was the reference but "Z" wasn't when "BB" was the reference. Hard to say what's going on without more details about the rest of your model, in particular any interaction terms.

answered Aug 06 '20 at 13:18

EdM

57,766
7
66
187

Thank you for your reply. My model is: ```logit highrisk age i.period i.generation control``` in which highrisk = 1 if respondent willing to take high-risk investment, 0 otherwise; 4-year period from 1993 to 2019 (93-96, 97-00, ..., 13-16, 17-19), generation and control variable. – Long Duong Aug 06 '20 at 13:32
@LongDuong your model has a couple of more serious potential problems. First, if "BB" is my baby-boomer generation and so forth with gen X, Y and Z, then the `age` variable is necessarily and directly related to `generation`, posing a substantial collinearity problem. Also, some younger generations Y and Z were unlikely to have been investing in the earlier periods; most gen Z weren't even born in the 93-96 period. I'd deal with those collinearity and modeling problems before I got too concerned about the differences in reporting significance for this model. – EdM Aug 06 '20 at 14:24
That's very helpful, I just checked the vif and all generations got more than 10. Is there any way I can solve this? Should I divide my model into 2 part and regress age-period and generation-period differently then use the one with higher R2 to interpret period and control variables? – Long Duong Aug 06 '20 at 14:52
@LongDuong you risk overfitting if you just use the model with higher $R^2$ based on this data sample, and $R^2$ isn't a good measure for a logistic model anyway. You're usually better off using a continuous (age) rather than a grouped (generation) predictor. It might be best to use `birth-year` instead of `age` then include an interaction of `birth-year` with `period` to take into account generational differences. Modeling a time series like this is always tricky, and I'm not very comfortable with them. You should consider getting some local statistical advice for this project. – EdM Aug 06 '20 at 15:11

Variables significant when changing reference category?

1 Answers1