1

I have a dataset to predict customers' Churn that contains categorical and numeric variables. I intend to perform a Logistic regression. I want to apply log transformation to some of the numeric predictors by adding a constant so all the values are at least 1.

My questions are:

Does this constant have to be the same for all the variables intended to transform?

Does this procedure have an impact on the coefficient estimates?

How should I interpret this coefficient of the transformed variable?

Any references would be of great help :)

  • Are you taking the log and then adding some constant, i.e. $c+\log x $ – gunes Jul 18 '21 at 11:35
  • No, my idea is to add a constant and then take the log(1+x). The minimum value of certain variables is -11, in others 0. So I would have to sum 12 or 1, respectively. Is this a valid procedure? – Gabriela Debska Jul 18 '21 at 22:17
  • Rescaling a variable is quite common, but rescaling all of the variables and then taking $log$ sounds a bit odd. The logistic regression model is able to handle negative values. If all of the covariates come from (let's say) a questionnaire where each response is coded to a discrete value, maybe you should consider using a log-linear model rather than logistic regression. – Spätzle Jul 19 '21 at 06:57
  • See https://stats.stackexchange.com/a/30749/919. – whuber Jul 19 '21 at 12:52
  • Using $log(1+X)$ is a bad idea because it makes the model dependent on the scale of the variables, e.g., whether they are measured in dollars or thousands of dollars. If you want your model to fit, why not use a flexible regression method like splines? – Noah Jul 19 '21 at 18:49

1 Answers1

1

a. Why do you need using a transformation, and why manipulating the $log$ transformation? Try to think if there's any other way.

b. No, this constant doesn't have to be the same, but try to add one transformation at a time and check each time that you are really improving the model.

c. Of course coefficients will be affected. In logistic regression the output variable is $\hat{\pi}_i=P(y_i=1|x_i)$ but don't forget it is the sigmoid of $\hat{\theta}_i$:

$$\hat{\pi}_i=\frac{e^{\hat{\theta_i}}}{1+e^{\hat{\theta_i}}}=\frac{e^{\hat{\beta}^Tx_i}}{1+e^{\hat{\beta}^Tx_i}}$$

As you transform a variable, the corresponding coefficient is bound to change as the MLE solution still tries to optimize the solution with respect to $y_i$.

Spätzle
  • 2,331
  • 1
  • 10
  • 25
  • a. I want to achieve linearity between the predictors and the log odds. – Gabriela Debska Jul 18 '21 at 22:17
  • That's great, but as I've wrote before keep validating that your transformations don't add some kind of collinearity to the data. Also, consider the side-effects on the intercept. – Spätzle Jul 19 '21 at 06:52