I checked several questions, but I think that my question is very simple comparing with others. I do understand that it is very naive question.
I created model 1, after that I eliminated the non-significant variables and I got a worst model measured in the value of R-squared. Someone could explain why? With my current knowledge I can not see any reason for that. My new Adjusted R-squared is 0.9399
My main guess is:
- Even when R2 is slightly worst for the second model; F-1 is better and F-1 measure of a test's accuracy. So, in this case, Model 2 is better in terms of accuracy.
My questions:
- Did Model 2 is better than model 1?
- Why R-squared is worst in Model 2?
- An F-1 better in model 2 is a guarantee of a better model?
I got two models:
Model 1: All variables
Call:
lm(formula = all ~ v1 + v2 + v3 + v4 + v5,
data = df)
Residuals:
Min 1Q Median 3Q Max
-5365.5 -1102.6 -85.9 868.9 7746.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.033e+04 4.219e+03 -2.449 0.016160 *
v1 3.175e+02 6.104e+01 5.201 1.12e-06 ***
v2 5.903e+02 3.151e+02 1.873 0.064085 .
v3 1.468e-01 4.083e-02 3.596 0.000512 ***
v4 9.864e-03 1.099e-02 0.898 0.371591
v5 7.414e-02 1.120e-02 6.620 2.06e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2059 on 96 degrees of freedom
Multiple R-squared: 0.9468, Adjusted R-squared: 0.944
F-statistic: 341.4 on 5 and 96 DF, p-value: < 2.2e-16
Model 2
Call:
lm(formula = all ~ v1 + v3 + v5, data = df)
Residuals:
Min 1Q Median 3Q Max
-4991.0 -1201.0 -166.8 1059.5 7281.3
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.115e+03 2.054e+03 -1.517 0.133
v1 3.019e+02 6.222e+01 4.852 4.61e-06 ***
v3 2.119e-01 3.425e-02 6.188 1.42e-08 ***
v5 6.381e-02 1.102e-02 5.789 8.52e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2132 on 98 degrees of freedom
Multiple R-squared: 0.9417, Adjusted R-squared: 0.9399
F-statistic: 527.7 on 3 and 98 DF, p-value: < 2.2e-16