0

Suppose I have a model output of a logistic regression: Y ~ 2.5X + 1.6Y -3.5Z. My goal is to understand the impact of variable Y.

  1. All variables have p < .05
  2. I see a big AIC change when I add Y variable in to the model
  3. Intuitively, Y makes sense that it is important from a business perspective. To me, I feel Y is important to the model.

My goal is to understand the log-odds impact of Y (which I'll convert to odds and probability later). At first, I thought 1.6 is the impact. However, as I was doing more tests, and I introduce variable W, my model output is now this:

Y~ 2.8X + 3.5Y + .05Z + 1.5W.
Now, I do a log liklihood test to compare the originalmodel without W and the alternate model with W. based on log likelihood test, I reject the null hypothesis and conclude that W should be included in the model. So now......going back to the impact of Y, is the "true" impact of Y actually 3.5 and not 1.6? Does a better overall model fit equate to more accurate coefficients of individual variables? If not, what is the best way to understand and have better confidence of my coefficients?

semidevil
  • 117
  • 7

1 Answers1

1

It's important to get away from the idea that there is one "true" importance for a predictor.

As you note, the coefficient for the same predictor changes depending on what other predictors are in the model. That's true in all regressions for an added or subtracted predictor that's correlated with both outcome and the included predictors. With logistic regression you can face this problem even if an omitted predictor isn't correlated with the included ones.

"All models are wrong, but some are useful.". More complex models are typically better for prediction provided that you don't overfit the available data. But the cost of collecting the extra data either for the modeling or for future application might not be worth it. That's the business decision you might have to make, informed by statistical modeling.

EdM
  • 57,766
  • 7
  • 66
  • 187