automatic model selection of linear models

Question

tail -c +43 uYayd.gif > TROW.tsv
tail -c +43 bAEMc.gif > AABB.tsv

Using the two files above, I can run linear models on them.

The following seems to indicate that either ema21diff or ema89diff can be used for the fitting very well.

R> summary(lm(futrdiff ~ ema21diff, data=TROW))

Call:
lm(formula = futrdiff ~ ema21diff, data = TROW)

Residuals:
    Min      1Q  Median      3Q     Max
-6.9238 -1.4405  0.0598  1.8670  8.0834

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  1.32199    0.38956   3.394 0.000899 ***
ema21diff   -0.66179    0.08244  -8.027 3.77e-13 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.479 on 139 degrees of freedom
Multiple R-squared:  0.3167,    Adjusted R-squared:  0.3118
F-statistic: 64.44 on 1 and 139 DF,  p-value: 3.774e-13

R> summary(lm(futrdiff ~ ema89diff, data=TROW))

Call:
lm(formula = futrdiff ~ ema89diff, data = TROW)

Residuals:
    Min      1Q  Median      3Q     Max
-5.5066 -1.7942 -0.0663  1.6676  7.6233

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  6.72792    0.93537   7.193 3.58e-11 ***
ema89diff   -0.52376    0.05945  -8.811 4.52e-15 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.402 on 139 degrees of freedom
Multiple R-squared:  0.3583,    Adjusted R-squared:  0.3537
F-statistic: 77.63 on 1 and 139 DF,  p-value: 4.515e-15

R> summary(lm(futrdiff ~ ema21diff + ema89diff, data=TROW))

Call:
lm(formula = futrdiff ~ ema21diff + ema89diff, data = TROW)

Residuals:
    Min      1Q  Median      3Q     Max
-5.7963 -1.7125  0.0304  1.7103  7.6391

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   5.4699     1.3091   4.178 5.18e-05 ***
ema21diff    -0.2148     0.1569  -1.369   0.1732
ema89diff    -0.3861     0.1167  -3.308   0.0012 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.395 on 138 degrees of freedom
Multiple R-squared:  0.3669,    Adjusted R-squared:  0.3578
F-statistic: 39.99 on 2 and 138 DF,  p-value: 1.993e-14

The following seems to indicate only ema89diff matters, but ema21diff is not.

R> summary(lm(futrdiff ~ ema21diff, data=AABB))

Call:
lm(formula = futrdiff ~ ema21diff, data = AABB)

Residuals:
    Min      1Q  Median      3Q     Max
-6.6453 -1.0660  0.1424  1.5878  3.7737

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.82788    0.18510  -4.473 1.59e-05 ***
ema21diff   -0.29036    0.08208  -3.537  0.00055 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.021 on 139 degrees of freedom
Multiple R-squared:  0.08258,   Adjusted R-squared:  0.07598
F-statistic: 12.51 on 1 and 139 DF,  p-value: 0.00055

R> summary(lm(futrdiff ~ ema89diff, data=AABB))

Call:
lm(formula = futrdiff ~ ema89diff, data = AABB)

Residuals:
    Min      1Q  Median      3Q     Max
-5.6130 -1.0894  0.1935  1.4290  4.4952

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.32680    0.32840   0.995    0.321
ema89diff   -0.29094    0.05865  -4.961 2.02e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.945 on 139 degrees of freedom
Multiple R-squared:  0.1504,    Adjusted R-squared:  0.1443
F-statistic: 24.61 on 1 and 139 DF,  p-value: 2.018e-06

R> summary(lm(futrdiff ~ ema21diff+ema89diff, data=AABB))

Call:
lm(formula = futrdiff ~ ema21diff + ema89diff, data = AABB)

Residuals:
   Min     1Q Median     3Q    Max
-5.578 -1.140  0.206  1.361  4.593

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   0.6154     0.4542   1.355 0.177682
ema21diff     0.1345     0.1462   0.920 0.359045
ema89diff    -0.3750     0.1086  -3.454 0.000733 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.946 on 138 degrees of freedom
Multiple R-squared:  0.1556,    Adjusted R-squared:  0.1434
F-statistic: 12.71 on 2 and 138 DF,  p-value: 8.547e-06

It is trivial to manually examine model selection like this. Could anybody show me an automated and commonly used way to detect the best linear models (possibly almost equivalently best models) for a fitting?

Perhaps you want the [step function](https://stat.ethz.ch/R-manual/R-devel/library/stats/html/step.html). Or this question might help: https://stats.stackexchange.com/questions/20836/algorithms-for-automatic-model-selection — Sam Rogers, Aug 03 '21 at 06:33
What is your goal? Is it about predicting? Is it about something else? Looking at multiple possible models and then doing something with a single selected model afterwards has issues (that are less or more bad depending on what you are trying to do, and how you do it e.g. stepwise regression is known to be particularly bad). Do you need to decide on a single model (i.e. could averaging - with data-determined-weights - over the candidate models be an option)? — Björn, Aug 03 '21 at 08:33
Automatic model selection = analyst turning over thinking tasks to the computer. The rumor that "unimportant" variables should be dropped from models should have been squashed in the 1960s. — Frank Harrell, Aug 03 '21 at 11:47

score 1 · Answer 1 · answered Aug 03 '21 at 17:19

What you have done by hand is automatically done by the R function anova when given a single fitted model:

> anova(lm(mpg ~ disp + wt + cyl, data=mtcars))
Analysis of Variance Table

Response: mpg
          Df Sum Sq Mean Sq F value    Pr(>F)    
disp       1 808.89  808.89 120.158 1.221e-11 ***
wt         1  70.48   70.48  10.469  0.003111 ** 
cyl        1  58.19   58.19   8.644  0.006512 ** 
Residuals 28 188.49    6.73                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Variables are added in the order in which they are listed in the model formula.

Beware however the comment by @björn to your question: this is an example of a stepwise method. And, what is even more problematic, the evaluation criterion is internal (the data used for training the model is resubstituted for evaluation), which might result in overfitting on peculiar properties of your training data. If you care about prediction accuracy, you might consider a selection criterion based on cross-validation, e.g. the leave-one-out mean squared prediction error.

automatic model selection of linear models

1 Answers1