Linear regression with log transformed data - large error

Question

I have a set of data which is has a very large positive skew, and has been transformed using a logarithm. I wish to predict one variable from another using the lm function in R. Since both variables have been transformed, I am well aware that my regression will output the equation:

ln(y) = b*ln(x) + a, where a and b are the coefficients.

The model fit is good, with an R squared of almost 0.6, producing a range of predicted y values.

Now, i have 'back-transformed' the variables using the following equation:

y_predicted = exp(a)*x^b

However, the predicted values for the larger x and y are significantly lower than they should be. Since I am going to be using the mean and sum of all of the y_predicted values in comparison with the y_actual values, this makes my model under predict by around 75%.

Due to the logarithmic scale, a small deviation from the line of best fit in the log domain, has resulted in a very large deviation when back-transformed.

My question, is how to adequately deal with this? I can come up with my own regression coefficients, which ensures that the line of best fit over-predicts some of these larger values, and makes the sum more aligned. However, this would go against the point of using a linear model in the first place, which optimises the model.

Also, i am not sure how 'statistically' valid this would be, as the method could not be replicated, as the coefficients were determined by eye.

Thoughts welcome!

Add the error term to the log-model and include it in your back-transformation. You'll see that with the log-transformation you have made an assumption regarding your errors and as a consequence you've placed less weight on larger values. If that's not what you want to do, you might need to make different assumptions. Possibly a GLM would be preferable? — Roland, Jan 05 '17 at 12:38
@Roland, this might seem like a daft question, but how do I add the error term using R? The `lm` function provides residuals, but no overall error. Also, what is the assumption regarding the errors using the log-transformation? That these are transformed too, i assume? However i don't see how this will help with the predictions. — sym246, Jan 05 '17 at 13:14

score 17 · Accepted Answer · edited Jan 05 '17 at 15:49

17

If you say your model is ln(y) = b*ln(x) + a it is only part of your model. Your actual model includes an error term:

$\ln y_i = b\cdot \ln x_i + a + \varepsilon_i$

and you assume that the error distribution is $\varepsilon_i \sim \mathcal{N}(0,\,\sigma^2)$. Now let's back-transform it:

$y_i = \exp(a) \cdot x_i^b \cdot \exp(\varepsilon_i)$

As you see, you have a multiplicative error term, i.e., a relative error with constant variation. As a result, you allow more deviation from the fitted line in your higher fitted values, i.e., you place less weight on them. This actually is often justified, but of course gives you larger residuals for higher values as you have observed.

If you are not happy with this, you should not do transformation followed by OLS. One alternative would be a Generalized Linear Model, which models the error differently, or even non-linear regression.

edited Jan 05 '17 at 15:49

Nick Cox

48,377
8
110
156

answered Jan 05 '17 at 13:36

Roland

5,758
1
28
60

Thank you. I now understand why the error is behaving in this way. Therefore, are you saying that if this multiplicative error is not justified in my context, then an alternative would be a GLM? I am not too familiair with GLMs but i will have a read. In your experience, how would a log transformed explanatory and predictor variable be used with a GLM? I have found examples of a log-linear model, but not of a log-log model. Thanks again – sym246 Jan 05 '17 at 13:47
1

If you can do a log-linear GLM, just pass `log(x)` to it: `y ~ log(x)` in the formula. – Roland Jan 05 '17 at 13:48

score 3 · Answer 2 · answered Jan 05 '17 at 13:40

Roland already gave a good answer. To say the same thing another way - you shoved some dirt under a carpet. Then you cleaned the top of the carpet. The dirt is still there!

There are several models that don't rely on normality of residuals. One that I think is very under-used is quantile regression. In R there is the quantreg package.

Linear regression with log transformed data - large error

2 Answers2

Linked