1

I want to transform the response variable that has both negative and positive values. When I looked at this, I saw that most people recommend constant addition to the variable and then take the logarithm of it. To exemplify, I'll provide a simple scenario where this is the case.

import numpy as np
Response_variable = [1,-2,3,-4,5,-7,-8]
response_new = Response variable + 9 
response_transformed = np.log(response_new)

I've seen this method being used in multiple models. In fact, I used it in my least squares regression model and the R-squared value did increase. However, what I couldn't understand is how new results with transformed response variables can be still accurate and helpful if we've changed the actual response variables.

  • 2
    Log transformation is useful when the target response variable has very large skew. For example, let us say that your data is actually samples from $y = ae^{-bx}$. However, you don't know this, and try to model it using linear models, but linear models can never fit this data. However, if you take the log of the target response variable, you get $\log y = -bx + \log a$. This becomes linear and hence you can use linear models to fit this. – TenaliRaman Jun 16 '21 at 09:12
  • If that's the case, we can use it to deal with heteroscedasticity and it wouldn't cause inaccuracy right? – Ahmet Atilla Colak Jun 16 '21 at 09:21
  • 1
    Remember that, when we make an assumption about normality in linear regression (we don’t always have to), we make it about the error term (estimated by the residuals), not about the pooled distribution of the response variable ($y$), and certainly not about any predictor variables. – Dave Jun 16 '21 at 09:57
  • 1
    @AhmetAtillaColak Most of the times but not always: https://stats.stackexchange.com/questions/336315/will-log-transformation-always-mitigate-heteroskedasticity – TenaliRaman Jun 16 '21 at 09:59

0 Answers0