Modelling non-linearity for binary independent variables in logistic regression

Question

I have fit a logistic regression where the response variable is binary - whether an interview candidate got the position or not - and the independent variables are a combination of continuous, categorical, and binary variables. In order to test the assumption of linearity between log-odds and predictors, I carried out a Box-Tidwell test on all the continuous and binary variables, increasing each variable by 1 so that all variables in the Box-Tidwell test are positive.

The results indicated that several of the binary variables have a non-linear relationship with the log odds of the outcome. I want to include this non-linearity in the model - and I want to know what strategies are available to me to do so. So far, I think I can:

Take the up-shifted binary variables, e.g. where the original binary variable $X_1 \in \{0,1\}$ and $X_1' = X_1 + 1$, then $X_1' \in \{1,2\}$. Then, as with a continuous variable, I could include a polynomial term - so I regress the log-odds of the outcome variable on $X_1' + X_1'^{2}$.
Apply the same up-shift to the binary variables, but then take the log, i.e. regress $Y$ on $\log(X_1')$.

Are there any other strategies for modelling non-linear effects of binary independent variables? What are the advantages and disadvantages of these strategies?

It would help if you could show some sample data, your original model, and details of the test you ran. It's hard to see how a binary predictor can have a "non-linear" association with log-odds. Please do that by editing the question, as comments are easy to overlook and can get deleted. — EdM, Jan 07 '22 at 22:39

score 1 · Answer 1 · answered Jan 08 '22 at 22:43

You fundamentally can't have non-linearity for a binary predictor in a regression. With standard treatment coding, its reference level is subsumed in the intercept and its coefficient represents the difference in the linear-predictor value when it instead takes its non-reference level and all other predictors are held constant. That's just a single value. With a logistic regression, it's the difference in outcome log-odds between the two levels of the predictor.

If you had a 3-level ordinal predictor and coded it as continuous you could have non-linearity, in that the effect of a change from the first, reference level to the third level might not be twice that of a change from the first to the second level. But there's no possibility of such non-linearity if there are only 2 levels to the predictor. There's just a single difference in linear-predictor values between its 2 levels.

Without further information it's hard to know what's going on with your Box-Tidwell test. I'm not very familiar with it, and I'm not sure that it even is valid for binary predictors. Perhaps you are picking up a missing interaction term between the binary predictor and some continuous predictor?

The Box-Tidwell test isn't the best way to evaluate and model non-linearity in any event. It seems designed to pick up a non-linearity that might be fixed with a power transformation of the predictor. What's better is to use flexible modeling of continuous predictors, as with splines, to handle non-linearity directly. That allows the data to tell you the shape of the association between outcome and predictor, beyond a simple power transformation, and to evaluate the significance of any non-linearity that you do find. You also might benefit from incorporating interactions among predictors, interactions whose absence might be masquerading as non-linearities.

Modelling non-linearity for binary independent variables in logistic regression

1 Answers1