3

I try to learn which transformations are better for model and I am trying to compare models that I build. The first model is

Call:
lm(formula = log(medv) ~ log(crim) + zn + log(indus) + chas + 
log(nox) + log(rm) + log(age) + log(dis) + log(rad) + log(tax) + 
log(ptratio) + log(black) + log(lstat), data = Boston)

Residuals:
Min       1Q   Median       3Q      Max 
-0.95001 -0.10118 -0.00198  0.10961  0.82680 

Coefficients:
           Estimate Std. Error t value Pr(>|t|)    
(Intercept)   5.3504375  0.4336744  12.337  < 2e-16 ***
log(crim)    -0.0314413  0.0111790  -2.813 0.005112 ** 
zn           -0.0011481  0.0005828  -1.970 0.049410 *  
log(indus)    0.0037637  0.0224508   0.168 0.866935    
chas          0.1011952  0.0362298   2.793 0.005423 ** 
log(nox)     -0.3659159  0.1074552  -3.405 0.000715 ***
log(rm)       0.3843709  0.1094673   3.511 0.000487 ***
log(age)      0.0410625  0.0223547   1.837 0.066833 .  
log(dis)     -0.1438053  0.0356083  -4.039 6.24e-05 ***
log(rad)      0.0949062  0.0220954   4.295 2.10e-05 ***
log(tax)     -0.1759806  0.0477668  -3.684 0.000255 ***
log(ptratio) -0.5895440  0.0912645  -6.460 2.52e-10 ***
log(black)    0.0532854  0.0126549   4.211 3.03e-05 ***
log(lstat)   -0.4186032  0.0258019 -16.224  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1988 on 492 degrees of freedom
Multiple R-squared:  0.7697,    Adjusted R-squared:  0.7636 
F-statistic: 126.5 on 13 and 492 DF,  p-value: < 2.2e-16

Second Model is

Call:
lm(formula = medv ~ log(crim) + zn + log(indus) + chas + log(nox) + 
log(rm) + log(age) + log(dis) + log(rad) + log(tax) + log(ptratio) + 
log(black) + log(lstat), data = Boston)

Residuals:
Min       1Q   Median       3Q      Max 
-13.3551  -2.5733  -0.2924   2.0704  22.8158 

Coefficients:
           Estimate Std. Error t value Pr(>|t|)    
(Intercept)   7.449e+01  9.307e+00   8.004 8.74e-15 ***
log(crim)     7.002e-02  2.399e-01   0.292 0.770524    
zn           -1.257e-04  1.251e-02  -0.010 0.991983    
log(indus)   -8.557e-01  4.818e-01  -1.776 0.076366 .  
chas          2.480e+00  7.775e-01   3.190 0.001514 ** 
log(nox)     -1.160e+01  2.306e+00  -5.030 6.90e-07 ***
log(rm)       1.374e+01  2.349e+00   5.850 8.98e-09 ***
log(age)      8.034e-01  4.798e-01   1.675 0.094658 .  
log(dis)     -6.327e+00  7.642e-01  -8.280 1.17e-15 ***
log(rad)      1.972e+00  4.742e-01   4.158 3.78e-05 ***
log(tax)     -4.277e+00  1.025e+00  -4.172 3.57e-05 ***
log(ptratio) -1.357e+01  1.959e+00  -6.927 1.35e-11 ***
log(black)    1.005e+00  2.716e-01   3.701 0.000239 ***
log(lstat)   -9.654e+00  5.537e-01 -17.433  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.266 on 492 degrees of freedom
Multiple R-squared:  0.7904,    Adjusted R-squared:  0.7849 
F-statistic: 142.7 on 13 and 492 DF,  p-value: < 2.2e-16

The difference between models is only log transformation of dependent variable. When I compare I saw that residual standard error is very high in the second model but R-squared is also high in the second model. I did not understand which model is better. The high reduction in the standard error is due to log transformation of dependent variable or not?

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
tyer
  • 51
  • 1
  • 6
  • 4
    On the `medv` scale the first model is multiplicative and the second one is additive. These models are not even similar. Since the residual standard errors are not on the same scale, you can't compare them. Also, due to the large number of predictors you are probably overfitting and should test for multicollinearity. – Roland Sep 17 '15 at 11:40
  • I'm curious where you got your data from and if it's freely available. – TrynnaDoStat Sep 17 '15 at 14:34
  • Yes, it is free. It is R data in MASS package. http://www.clemson.edu/economics/faculty/wilson/R-tutorial/analyzing_data.html @TrynnaDoStat – tyer Sep 17 '15 at 14:38

3 Answers3

5

@Tim is right that the log is changing the residual standard error and that comparing this on the two errors is meaningless. Why is this so? Consider a much simpler case: Suppose the DV is income (in dollars) and one scale predicts Joe's income to be \$100,000 when his real income is \$90,000. Error is $10,000. Take log (base 10) and get a predicted value (even if everything else stays the same) of 5 and an actual value of 4.95 and an error of 0.05 (this isn't exactly what's going on, but I think it gives you a feel for the reason things change).

Whether you should transform your DV should depend on substantive reasons more than statistical ones. You didn't say what MEDV and the other variables are, but it looks like MEDV is median value and this is predicted cost of a house or something like that.

When the DV is a dollar amount, taking logs often makes sense because we often think of these amounts on a multiplicative scale. That is, the difference between a \$100,000 house and a \$200,000 house is huge. The difference between a \$1,000,000 house and a \$1,100,000 house is much smaller.

Peter Flom
  • 94,055
  • 35
  • 143
  • 276
  • Thank You. Yes, medv is Median value of owner-occupied homes in $1000's. I think taking log of dependent variable is still meaningful based on what you said right? – tyer Sep 17 '15 at 12:33
  • 1
    Yes, I think so. Although you might want to take log in actual dollars and base 10. One problem with log transformation is that the interpretation isn't quite as obvious as with raw data, but, in my experience, log10(raw value) is easier than other methods. The conclusions shouldn't change. – Peter Flom Sep 17 '15 at 12:48
4

Yes, the log transformation of the dependent variable is what is reducing your standard error, so comparison of the standard errors of the residuals is meaningless.

Based on r-square alone, the second model is better. But, to make a sensible choice between the models you need to be working out which of the models has a more sensible interpretation in your area of application (e.g., the first model's coefficients are elasticities, which makes a lot of sense in economics and marketing, but may not in your field; and, which model's coefficients have the more sensible signs and magnitudes?), and comparing the various standard diagnostics (e.g., outliers, normality of residuals).

Tim
  • 3,255
  • 14
  • 24
  • Hi, Thank you for your answer. We want to make automatic model selection for linear regression. Because there will be program analyzing data and build model automatically. Do you suggest any way of selection transformation for regression model automatically for computer. Since we will build many models for neighbourhoods in our country. @Tim – tyer Sep 17 '15 at 12:17
  • 2
    I strongly advise against automatic anything. – Peter Flom Sep 17 '15 at 12:49
  • 2
    A Box Cox transformation is the most straightforward approach, but I tend to agree with @Peter Flom. – Tim Sep 17 '15 at 12:52
  • Can I ask why? I think that maybe computer can try quadratic polynomial regression, linear regression and linear regression with log transformation of independent variables, and then choose the model which has lowest standard error. Is it make sense? Also I will try to do automatically regression diagnostics. @Peter Flom – tyer Sep 17 '15 at 13:04
  • Also maybe I will add some interaction terms to the models – tyer Sep 17 '15 at 13:12
  • @tyer What is your goal here? Inference, prediction or something else? – Roland Sep 17 '15 at 13:32
  • I try to predict home prices with building models based on neighbourhood. every neighbourhood can need different model so, I want to make model selection automatically since there are huge amount of neighbourhood @Roland – tyer Sep 17 '15 at 13:35
  • 1
    @tyer I think you should educate yourself about machine learning approaches. Especially you should focus on the topic of model validation. – Roland Sep 17 '15 at 14:01
  • "Based on r-square alone, the second model is better": but one is on a log scale and the other is not! *So we should not compare the $R^2$ values. They are incomparable.* See e.g. Gujarati's "Basic econometrics", which deals with this point in a section on "comparing two $R^2$ values" immediately after introducing $R^2$ for multiple regression. (He suggests if you have one model using $\log Y_i$ and another for $Y_i$, then obtain fitted $\hat {\log Y_i}$ from the 1st model, antilog them, then find the $R^2$ between the result and $Y_i$. This $R^2$ can then be compared to $R^2$ of 2nd model.) – Silverfish Sep 17 '15 at 14:45
  • 1
    I studied machine learning approaches I know after building model I should do cross-validation right? @Roland – tyer Sep 18 '15 at 06:38
0

I will try to help you because i do not speak English very well a short answer: Your log dependent variable cannot be compare directly. What you have to do is create a new variable called "mi". mi=exp(log (yi)) where yi are the fitted values by regression over your X(s) variables (log model).

Then, an objective way to compare both models is using a new R2' estimator of the log model comparable with the normal linear:

R2' = cor(yi,mi)^2 where, yi is the vector of the original dependent variable, and cor is the correlation function. if R2'>R2(linear model), log model is better.

For forecasting purposes, yo cannot predict a value using the exp function over a predicted log value, you will be systematically sub-estimating the expected value of y.

(source: Wooldridge, Introductory econometrics 2012)

Pipo
  • 1