4

I am curious about the consequences of changing the order of the explanatory variables in a binary logistic regression. In a recent series of logistic regressions I ran in SPSS, I found that changing the order of the explanatory variables ($z, y, x$ instead of $x, y, z$) resulted in different coefficient values and significance levels. Investigating further, I got the same results in R using the same orders. Clearly, shifting the predictors around matters in terms of the results—my question is how? (And yes, I checked to make sure I wasn't entering the variables stepwise.)

Macro
  • 40,561
  • 8
  • 143
  • 148
darkfaculties
  • 41
  • 1
  • 2

3 Answers3

6

The order of the explanatory does not matter in logistic regression. The symptoms you describe, i.e. the unstable results, may hint at multicollinearity problems or problems of quasi-complete separation in your data. Do you also have large standard errors? As your independent variables are factors, you can tabulate them one against each other and inspect the relation between them. You should also tabulate the covariate patterns against your dependent variable. You can also start by fitting the model stepwise. Then you will see at which point the problem arises.

  • Even if there was multicollinearity, the coefficients wouldn't change based on what order you entered them into the formula. – Macro Sep 08 '11 at 11:15
  • 2
    OK, but in the presence of multicollinearity, the maximisation routine can have problems to converge, or may even fail to converge. These problems can translate into dicely parameter estimates. –  Sep 08 '11 at 14:03
4

The order in which you enter the variables (into a single model) does not matter. The logistic regression model of a binary response, $Y$, with three predictors $X_{1},X_{2},X_{3}$ -

$$ \log \left( \frac{ P(Y=1) }{ P(Y=0) } \right) = \beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} + \beta_{3} X_{3} $$

will give identical results to the "re-arranged" logistic regression model

$$ \log \left( \frac{ P(Y=1) }{ P(Y=0) } \right) = \alpha_{0} + \alpha_{1} X_{2} + \alpha_{2} X_{3} + \alpha_{3} X_{1} $$

with the coefficients re-arranged appropriately. Here some R code that verifies this:

x1=rnorm(100)
x2=rnorm(100)
x3=rnorm(100)
y=-.3 + .5*x1 + 1.2*x2 - .75*x3 + rlogis(100)
y=(y>0)
g1=glm(y~x1+x2+x3,family="binomial")
g2=glm(y~x2+x3+x1,family="binomial")
summary(g1)
summary(g2)

The code generates from the model

$$ \log \left( \frac{ P(Y=1) }{ P(Y=0) } \right) = -.3 + .5 X_{1} + 1.2 X_{2} - .75 X_{3} $$

The estimates for the coefficients of $X_{1}, X_{2}, X_{3}$ (and their significance) is the same regardless of what order you put the variables in. I'm not sure why you'd be finding results that disagree with this. A coding error may be the culprit.

Edit: Based on the comment discussion, it appears that perhaps you are using a categorical predictor and the variables you're including are dummy variables, indicating levels of the predictor. In that case, the order does matter because, in R, the first one entered will be chosen as the "reference" level of the variable and all comparisons are made with respect to it. So, if you change the reference level, you also change the thing the coefficients are being compared to, therefore the estimates and $p$-values can certainly change.

Macro
  • 40,561
  • 8
  • 143
  • 148
  • thanks for the response. Can you think of any reason why the following two equations might produce different results? `g1 – darkfaculties Sep 08 '11 at 05:05
  • Another clue—one of the models kicks out one of the categorical variables with the message "Coefficients: (1 not defined because of singularities)," but the other model doesn't do this. I don't understand why the order of the terms would cause R to boot that variable in one model but not the other. – darkfaculties Sep 08 '11 at 05:18
  • Normally that error message indicates that some parameters are non-identified. For example, if level '1' of x1 always perfectly coincided with level '1' of x2, the two effects can not be separated from each other. It is mysterious why this would depend on the order in which you enter the variables. Are you sure there is no other code entered between the two calls to glm()? – Macro Sep 08 '11 at 11:21
  • Yeah, the order issue is perplexing. But the same thing happened in both R and SPSS, which makes me think it's more than just a simple bug. Same code both times; I promise. – darkfaculties Sep 08 '11 at 12:57
  • 1
    Are you sure that they are the same coefficients? When you include factors, one level from each is dropped. Maybe a different level is dropped in each of orderings? – Charlie Sep 08 '11 at 14:35
  • @Charlie, yes, this is what's happening. Actually, in one ordering one level is dropped, but in the other ordering _zero_ levels are dropped, which accounts for the differences in results. (The algorithm also fails to converge in the latter.) What I can't figure out is, why would the order affect _whether_ a level was dropped as opposed to _which_ level? And given the difference, which results should I choose? I'm guessing the one with the level dropped, as that fixes the multicollinearity issue. – darkfaculties Sep 08 '11 at 14:44
  • You can't have 0 levels dropped or else you do have multicolinearity. If the algorithm isn't dropping a level to ensure that the design matrix (the matrix with all your predictor/x variables) is invertible, then that's a bit screwy. I'd be surprised if a canned product is failing to drop a level to remove problems of multicolinearity. That said, I would create dummy variables for each level of each factor, then I would exclude on my own one from each group and running the regression on the remaining dummies. This should be reliable in any order. – Charlie Sep 08 '11 at 23:29
1

You could try the following in SPSS and see whether it sheds any light:

graph/scatter (matrix) outcome factor1 factor2 factor3.
rolando2
  • 11,645
  • 1
  • 39
  • 60