I have a set of data which is has a very large positive skew, and has been transformed using a logarithm. I wish to predict one variable from another using the lm
function in R. Since both variables have been transformed, I am well aware that my regression will output the equation:
ln(y) = b*ln(x) + a
, where a
and b
are the coefficients.
The model fit is good, with an R squared of almost 0.6, producing a range of predicted y values.
Now, i have 'back-transformed' the variables using the following equation:
y_predicted = exp(a)*x^b
However, the predicted values for the larger x and y are significantly lower than they should be. Since I am going to be using the mean and sum of all of the y_predicted values in comparison with the y_actual values, this makes my model under predict by around 75%.
Due to the logarithmic scale, a small deviation from the line of best fit in the log domain, has resulted in a very large deviation when back-transformed.
My question, is how to adequately deal with this? I can come up with my own regression coefficients, which ensures that the line of best fit over-predicts some of these larger values, and makes the sum more aligned. However, this would go against the point of using a linear model in the first place, which optimises the model.
Also, i am not sure how 'statistically' valid this would be, as the method could not be replicated, as the coefficients were determined by eye.
Thoughts welcome!