1

I have fit a logistic regression where the response variable is binary - whether an interview candidate got the position or not - and the independent variables are a combination of continuous, categorical, and binary variables. In order to test the assumption of linearity between log-odds and predictors, I carried out a Box-Tidwell test on all the continuous and binary variables, increasing each variable by 1 so that all variables in the Box-Tidwell test are positive.

The results indicated that several of the binary variables have a non-linear relationship with the log odds of the outcome. I want to include this non-linearity in the model - and I want to know what strategies are available to me to do so. So far, I think I can:

  • Take the up-shifted binary variables, e.g. where the original binary variable $X_1 \in \{0,1\}$ and $X_1' = X_1 + 1$, then $X_1' \in \{1,2\}$. Then, as with a continuous variable, I could include a polynomial term - so I regress the log-odds of the outcome variable on $X_1' + X_1'^{2}$.
  • Apply the same up-shift to the binary variables, but then take the log, i.e. regress $Y$ on $\log(X_1')$.

Are there any other strategies for modelling non-linear effects of binary independent variables? What are the advantages and disadvantages of these strategies?

greggs
  • 190
  • 8
  • It would help if you could show some sample data, your original model, and details of the test you ran. It's hard to see how a binary predictor can have a "non-linear" association with log-odds. Please do that by editing the question, as comments are easy to overlook and can get deleted. – EdM Jan 07 '22 at 22:39

1 Answers1

1

You fundamentally can't have non-linearity for a binary predictor in a regression. With standard treatment coding, its reference level is subsumed in the intercept and its coefficient represents the difference in the linear-predictor value when it instead takes its non-reference level and all other predictors are held constant. That's just a single value. With a logistic regression, it's the difference in outcome log-odds between the two levels of the predictor.

If you had a 3-level ordinal predictor and coded it as continuous you could have non-linearity, in that the effect of a change from the first, reference level to the third level might not be twice that of a change from the first to the second level. But there's no possibility of such non-linearity if there are only 2 levels to the predictor. There's just a single difference in linear-predictor values between its 2 levels.

Without further information it's hard to know what's going on with your Box-Tidwell test. I'm not very familiar with it, and I'm not sure that it even is valid for binary predictors. Perhaps you are picking up a missing interaction term between the binary predictor and some continuous predictor?

The Box-Tidwell test isn't the best way to evaluate and model non-linearity in any event. It seems designed to pick up a non-linearity that might be fixed with a power transformation of the predictor. What's better is to use flexible modeling of continuous predictors, as with splines, to handle non-linearity directly. That allows the data to tell you the shape of the association between outcome and predictor, beyond a simple power transformation, and to evaluate the significance of any non-linearity that you do find. You also might benefit from incorporating interactions among predictors, interactions whose absence might be masquerading as non-linearities.

EdM
  • 57,766
  • 7
  • 66
  • 187