I am running regression model, and I found out that the one independent variable is dispersed in a very narrow range.
This is the summary statistic of variables, and variable age has some outliers, and not normality distributed.
size---------------------- age ---------------- dc
Min. : 69.30 ----------- Min. : 2.20 -------- Min. : 0.30
1st Qu.: 86.10 ------- 1st Qu.: 4.30 ----- 1st Qu.: 9.80
Median :101.60 ----- Median : 8.80 ----- Median :13.90
Mean : 99.06 -------- Mean :11.94 ------- Mean :14.14
3rd Qu.:111.00 ------ 3rd Qu.:16.85 ----- 3rd Qu.:18.00
Max. :134.80 -------- Max. :49.70 ------- Max. :29.00
dt --------------------- price
Min. : 3.10 --------- Min. : 311.6
1st Qu.:18.85 ---- 1st Qu.: 486.7
Median :31.20 ---- Median : 589.9
Mean :34.02 ------ Mean : 600.0
3rd Qu.:49.30 ---- 3rd Qu.: 691.4
Max. :74.20 ------- Max. :1005.5
This is the model I am running.
fit <- lm(price ~ size + age + dc + dt, data=property)
Call:
lm(formula = price ~ size + age + dc + dt, data = property)
Residuals:
Min 1Q Median 3Q Max
-200.82 -79.53 5.38 63.98 244.10
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 158.8981 111.9334 1.420 0.1597
size 5.9175 0.8531 6.937 1.04e-09 ***
age -2.3803 1.2460 -1.910 0.0598 .
dc 0.6866 2.2066 0.311 0.7565
dt -3.7139 0.6255 -5.938 7.58e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 103.5 on 78 degrees of freedom
Multiple R-squared: 0.553, Adjusted R-squared: 0.5301
F-statistic: 24.13 on 4 and 78 DF, p-value: 5.182e-13
Here, I can see that the model is not fitting well since R-squared value is small. If I re-fit the model without variable age, the fit is even worse.
Call:
lm(formula = price ~ size + dc + dt, data = property)
Residuals:
Min 1Q Median 3Q Max
-190.807 -87.045 6.956 59.852 264.729
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 113.6972 111.2233 1.022 0.310
size 5.9812 0.8666 6.902 1.15e-09 ***
dc 1.4765 2.2036 0.670 0.505
dt -3.7343 0.6358 -5.874 9.61e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 105.2 on 79 degrees of freedom
Multiple R-squared: 0.5321, Adjusted R-squared: 0.5143
F-statistic: 29.95 on 3 and 79 DF, p-value: 4.933e-13
The R-squared value is smaller than the model with variable age. I took some outliers in age variable and run a regression again, but still there is no big difference. I tried transformation for all other variables and dependent variable also, but still it improves very small adjusted R-squared value.
The data is all numeric variables, so adding interaction terms do not help in this model I guess. Should I convert variable age to categorical variable? When I scatterplot the variable age, it seems like almost every observations are in a certain range. So, I am thinking if I can convert certain range of age variable into 1 and 0, and then make it categorical variable to see if I can detect some possible interaction with other variables. I do not know which step I should start with in order to fit the best model.
And also, I found out something interesting that the model without intercept increases adjusted R-squared value significantly!
Call:
lm(formula = price ~ size + age + dc + dt - 1, data = property)
Residuals:
Min 1Q Median 3Q Max
-201.57 -78.66 14.37 69.96 258.56
Coefficients:
Estimate Std. Error t value Pr(>|t|)
size 7.0281 0.3423 20.531 < 2e-16 ***
age -2.0064 1.2257 -1.637 0.1056
dc 3.0443 1.4622 2.082 0.0406 *
dt -3.4364 0.5979 -5.747 1.63e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 104.1 on 79 degrees of freedom
Multiple R-squared: 0.973, Adjusted R-squared: 0.9717
F-statistic: 712.5 on 4 and 79 DF, p-value: < 2.2e-16
R-squared value increased by 0.97! Should I remove the intercept? I am kind of confused if there is any other possible way to improve this model. Can someone give me some tips? How should I deal with skewed variable age?