0

I have a logistic regression with three independent variables. The correlation coefficients between the three variables are:

enter image description here

For two of the variables with correlation coef of 0.1, if you do a scatter plot definitely a relationship - let's call these variables X2 and X3.

enter image description here

I built a logistic regression of the form:

Y = X1 + X2 + X3 + X2 * X3

X1 and X2 are significant, X3 and X2X3 are not.

However, if I transform X3, by taking the log such that:

Y = X1 + X2 + log(X3) + X2 * log(X3)

X2 no longer is significant while log(X3) is. In other words, X1 and log(X3) are significant, but X2 and X2 * log(X3) are not.

The only thing I've read is that if variables are highly correlated, the significance could change. But in this case, it does not seem like the variables are highly correlated. Are there any other explanations for such a change in significance. The z-values (extracted from statsmodels in python) are at least beyond 3.5 when a coefficient is significant. So it's not 'barely' significant. I've also checked the correlation coefficent after transformation and it doesn't change much. -.25 drops to -.18 and .1 drops to .07

New correlation matrix - after transforming X3 via log(X3) enter image description here

confused
  • 2,453
  • 6
  • 26
  • The problems is with multicolinearity not collinearity between individual predictors when this is a problem (which it is not supposed to be very often).I don't think you can eliminate multi-colinearity if it exists by this type of transformation, certainly I have never heard of that. Transformations do matter for normality and that can impact the standard errors and thus the p scores. That is why normality is a problem. But this is usually not a major problems with enough data. How many data points do you have. – user54285 Feb 09 '21 at 00:43
  • 2
    Note that normality would be an assumption of the error term, not of the predictor variables or even the pooled distribution of the response variable. – Dave Feb 09 '21 at 01:04
  • @user54285 I don't think logistic regression makes any assumptions about the distribution of data. – confused Feb 09 '21 at 03:54
  • What code do you use to compute the 'significance'? (This might be of influence, eg think about the [order of variables](https://stats.stackexchange.com/a/213358)) Could you provide the code and the output. Also can you provide the covariance table after the transformation (It is unclear what you mean by -.25 drops to -.18 and .1 drops to .07). – Sextus Empiricus Feb 09 '21 at 07:44
  • I updated the correlation matrix. Also I just used the basic 'Logit' function within statsmodels in python without doing anything else: https://www.statsmodels.org/stable/generated/statsmodels.discrete.discrete_model.Logit.html – confused Feb 10 '21 at 16:07

1 Answers1

1

This is a matter of model form, not of colinearity, so the correlations between the variables will not help you interpret this phenomenon. The fact that you get significant results with one model but not with another model just means that the predictors with significant coefficients are conditionally associated with the outcome while the predictors in the other model may not be. There is no specific statistical reason why this happens. Transforming a variable is like fitting a totally different model. There is no reason to expect that any predictor would function the same in two different models.

The question of "which model is right for my data?" is an unanswerable question (otherwise we wouldn't need data). If you're in the business of model selection, you should use statistical techniques designed for that purpose. Trying a bunch of models to see which one fits best or yield significant results will invalidate any inferences made on that data.

Noah
  • 20,638
  • 2
  • 20
  • 58
  • Interesting. So doing a monotonic transformation can lead to an entirely new variable or result - I would have never realized that? I always thought transformations are to make your data fit assumptions, but with logistic regression there aren't as many assumptions as OLS linear regression. This also could make sense as to why neural networks involve basically the same variables but just transformed over and over - but you can get better predictions with different transformations. – confused Feb 09 '21 at 03:58
  • 1
    A 1-unit increase in $\log(x)$ is actually multiplying $x$ by $e$ (almost 3), which is a very different thing from adding 1 unit to $x$. So the meaning of the estimated coefficient is totally different. You never need to transform your predictors to meet assumptions, as there are no assumptions on their distribution. Only outcomes have distributional assumptions in OLS. That intuition about neural networks is a nice connection. All ML methods essentially attempt to respecify a model by transforming continuous predictors. – Noah Feb 09 '21 at 04:35
  • @Noah, you do not think that the change of the $X2$ variable from significant to not significant might have to do with a correlation between $X2$ and $\log(X3)$? – Sextus Empiricus Feb 09 '21 at 07:38
  • Noah I don't understand your comment that "You never need to transform your predictors to meet assumptions as there are no assumptions on their distribution." In fact regression does make assumptions about the distribution of the residuals, for example normality. And there is vast literature dealing with the need to transform variables specifically for this reason. None of which suggested this was fitting a different model. I am really surprised at your answer which is very different than a lot of things I have read. – user54285 Feb 10 '21 at 23:30
  • I said there are no assumptions on the distributions of the **predictors**. The vast literature is on transforming the **outcome** to meet model assumptions. The area of feature engineering (which *is* about transforming the predictors) is about trying to make the best predictions using the predictors at hand, not about meeting assumptions of the model. – Noah Feb 11 '21 at 03:10