Removing variables and getting a worst model

Question

I checked several questions, but I think that my question is very simple comparing with others. I do understand that it is very naive question.

I created model 1, after that I eliminated the non-significant variables and I got a worst model measured in the value of R-squared. Someone could explain why? With my current knowledge I can not see any reason for that. My new Adjusted R-squared is 0.9399

My main guess is:

Even when R2 is slightly worst for the second model; F-1 is better and F-1 measure of a test's accuracy. So, in this case, Model 2 is better in terms of accuracy.

My questions:

Did Model 2 is better than model 1?
Why R-squared is worst in Model 2?
An F-1 better in model 2 is a guarantee of a better model?

I got two models:

Model 1: All variables
Call:
lm(formula = all ~ v1 + v2 + v3 + v4 + v5, 
    data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-5365.5 -1102.6   -85.9   868.9  7746.5 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.033e+04  4.219e+03  -2.449 0.016160 *  
v1           3.175e+02  6.104e+01   5.201 1.12e-06 ***
v2           5.903e+02  3.151e+02   1.873 0.064085 .  
v3           1.468e-01  4.083e-02   3.596 0.000512 ***
v4           9.864e-03  1.099e-02   0.898 0.371591    
v5           7.414e-02  1.120e-02   6.620 2.06e-09 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2059 on 96 degrees of freedom
Multiple R-squared:  0.9468,    Adjusted R-squared:  0.944 
F-statistic: 341.4 on 5 and 96 DF,  p-value: < 2.2e-16

Model 2

Call:
lm(formula = all ~ v1 + v3 + v5, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-4991.0 -1201.0  -166.8  1059.5  7281.3 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3.115e+03  2.054e+03  -1.517    0.133    
v1           3.019e+02  6.222e+01   4.852 4.61e-06 ***
v3           2.119e-01  3.425e-02   6.188 1.42e-08 ***
v5           6.381e-02  1.102e-02   5.789 8.52e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2132 on 98 degrees of freedom
Multiple R-squared:  0.9417,    Adjusted R-squared:  0.9399 
F-statistic: 527.7 on 3 and 98 DF,  p-value: < 2.2e-16

I'm sorry, what do you mean by "F-1 is better" . Is this a classification problem, and you're using the F-1 score as your model evaluation metric? — Vladimir Belik, Jan 25 '22 at 01:18
Yes, I dont have any other idea about how to justify to get a low R2 value, for Model2 except that since R2 and R2 are very similar for model1 and model2 so; the difference could be consider irrelevant since F-1 is higher in Model2. — FRM, Jan 25 '22 at 01:24

score 2 · Accepted Answer · answered Jan 25 '22 at 01:26

2

Adding more variables will always increase the R squared. Additionally, there is no reason to remove variables which fail to reject the null as I explain here.

answered Jan 25 '22 at 01:26

Demetri Pananos

24,380
1
36
94

No reason to remove variable, depending on your objective* (as you explain in the other link). Additionally though, I believe adding more variables always increases the R-squared, yes. But not the adjusted R-squared, as it is called "adjusted" for the exact reason that it doesn't take into account the inflation of adding variables. No? – Vladimir Belik Jan 25 '22 at 02:11

Removing variables and getting a worst model

1 Answers1