4

I am seeing some results that point to a clear conceptual gap in my understanding of multinomial logistic regression and am seeking an explanation.

I am performing multinomial logistic regression on a dataset in which the dependent variable has three levels. I first cross-tabulated the relationship between the dependent variable y (with levels y1, y2 and y3) and one categorical independent variable x (with levels x1 and x2). The cross-table of x and y showing row-wise percentages looks like this:

                  y
         --------------------
         |   y1    y2    y3
  |--------------------------
  |  x1  |  47.6   28.4. 23.9
x |  x2  |  26.1   21.4. 52.5

From the above table, it is clear that the probability of y3 is much higher when x = x2 than when x = x1. Also, the odds of y3 relative to y1 are also much higher when x = x2 than when x = x1.

I then ran a multinomial logistic regression with several independent variables in addition to x, with y = y1 and x = x1 as the reference values. I got the following coefficients for x from the regression:


====================================================
                  Dependent variable:     
                  ----------------------------
                    y2        y3        
----------------------------------------------------
x_x2              1.079***    -0.484***  
----------------------------------------------------
                  

From the coefficients, I can see that the log odds of y3 relative to y1 will decrease by 0.484 when x changes from x1 to x2. This seems like an anomaly given how the odds of y3 relative to y1 in the cross-table above are much higher for x = x2 than for x = x1. Of course, I understand that in the regression, I am using many other independent variables and that these could have an impact, but am unable to see how such an impact might come about.

I would appreciate if someone can throw more light on how the presence of other independent variables can cause something like this to happen.

Viswa V
  • 43
  • 3

1 Answers1

5

This can happen when you condition on a mediator.

It is not specific to multinomial regression, or categorical data.

A simple simulation can demonstrate this:

set.seed(1)
N <- 100

X <- rnorm(N)
M <- -2*X + rnorm(N)   # mediator
Y <- X + M + rnorm(N)

Now we fit a model, with X as the only predictor:

> summary(lm(Y ~ X))

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.01032    0.13915  -0.074    0.941    
X           -0.97989    0.15456  -6.340 7.07e-09 ***

And now we also condition on M:

> summary(lm(Y ~ X + M))

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.02535    0.10519   0.241  0.81005    
X            0.91418    0.24824   3.683  0.00038 ***
M            0.94653    0.10948   8.646 1.12e-13 ***

and now we find that the sign of the estimate for X has reversed, which is equivalent to what you have found in your data. This happens because when you condition on a mediator, it blocks the otherwise open backdoor path between Y and X, and therefore only estimates the direct effect of X (which is positive), whereas, without conditioning on M, the backdoor path is open, so the model estimates the total effect of X on Y which is negative. More detail about this can be found here: How do DAGs help to reduce bias in causal inference?

If you want to see this with multinomial Y and binary X, then you can easily adapt the above code to discretise the data.

Robert Long
  • 53,316
  • 10
  • 84
  • 148
  • I found your clear explanation to be very helpful. Thanks! – Viswa V Feb 18 '21 at 21:53
  • @ViswaV you are very welcome. Thanks for the feedback :) If this answers your question, please consider marking it as the accepted answer (just click the tick mark) – Robert Long Feb 18 '21 at 21:55