0

I am attempting to get my head around the summaries of linear models given within R. In other words i am trying to identify when the summary of a model is good or bad. Consider the following two examples, they are both derived from the same data except the latter has been simplified using step(). Why is the latter better than the first?

Example one:

Residuals:
    Min      1Q  Median      3Q     Max 
-37.679 -11.893  -2.567   7.410  62.190 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)   
(Intercept)  42.942055  24.010879   1.788  0.07537 . 
seasonspring  3.726978   4.137741   0.901  0.36892   
seasonsummer  0.747597   4.020711   0.186  0.85270   
seasonwinter  3.692955   3.865391   0.955  0.34065   
sizemedium    3.263728   3.802051   0.858  0.39179   
sizesmall     9.682140   4.179971   2.316  0.02166 * 
speedlow      3.922084   4.706315   0.833  0.40573   
speedmedium   0.246764   3.241874   0.076  0.93941   
mxPH         -3.589118   2.703528  -1.328  0.18598   
mnO2          1.052636   0.705018   1.493  0.13715   
Cl           -0.040172   0.033661  -1.193  0.23426   
NO3          -1.511235   0.551339  -2.741  0.00674 **
NH4           0.001634   0.001003   1.628  0.10516   
oPO4         -0.005435   0.039884  -0.136  0.89177   
PO4          -0.052241   0.030755  -1.699  0.09109 . 
Chla         -0.088022   0.079998  -1.100  0.27265   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 17.65 on 182 degrees of freedom
Multiple R-squared:  0.3731,    Adjusted R-squared:  0.3215 
F-statistic: 7.223 on 15 and 182 DF,  p-value: 2.444e-12

Example Two:

Call:
lm(formula = a1 ~ size + mxPH + Cl + NO3 + PO4, data = clean.algae[, 
    1:12])

Residuals:
    Min      1Q  Median      3Q     Max 
-28.874 -12.732  -3.741   8.424  62.926 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 57.28555   20.96132   2.733  0.00687 ** 
sizemedium   2.80050    3.40190   0.823  0.41141    
sizesmall   10.40636    3.82243   2.722  0.00708 ** 
mxPH        -3.97076    2.48204  -1.600  0.11130    
Cl          -0.05227    0.03165  -1.651  0.10028    
NO3         -0.89529    0.35148  -2.547  0.01165 *  
PO4         -0.05911    0.01117  -5.291 3.32e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 17.5 on 191 degrees of freedom
Multiple R-squared:  0.3527,    Adjusted R-squared:  0.3324 
F-statistic: 17.35 on 6 and 191 DF,  p-value: 5.554e-16
godzilla
  • 593
  • 2
  • 6
  • 8
  • 2
    These are not data summaries they are model summaries; this is not nonlinear regression it is linear regression – Peter Flom May 19 '13 at 13:29
  • correction made – godzilla May 19 '13 at 13:31
  • 2
    @godzilla What do you mean by "good" and "bad"? What makes you think that the second model is "better"? By the way: It is generally [not recommended to do an automatic model selection](http://stats.stackexchange.com/questions/20836/algorithms-for-automatic-model-selection). – COOLSerdash May 19 '13 at 13:34
  • granted, i got this example from a book, according to the book the latter will be better for prediction (although it admits not a great deal better). I am trying to understand why this is the case – godzilla May 19 '13 at 13:36
  • You didn't show residuals from the first model, but in the second one, they look non-normal from the 5 number summary. – Peter Flom May 19 '13 at 13:37
  • 1
    I see very little that suggests the second is 'better for prediction' – Glen_b May 19 '13 at 13:38
  • @Glen, can you expand on this please? – godzilla May 19 '13 at 13:40
  • ok updated and added residuals to the first – godzilla May 19 '13 at 13:41
  • 1
    @godzilla for it to suggest that the second is better for prediction, there'd need to be some output there that suggested it. I didn't see any that clearly did so. Is there some particular thing that you think *does* suggest it? – Glen_b May 19 '13 at 13:55
  • 1
    @godzilla The reference to "a book" does not help. But the second model is on most overall crude figures of merit about as good as the first while being much simpler (6 predictors rather than 15 predictors), so if this were the only choice in the world, the second model is likely to perform better in out-of-sample prediction. Overfitting -- matching quirks in the data -- is more likely with the first. But neither model seems very good. Data on algae (numbers? concentrations? you tell us) seem unlikely to be well described by a hyperplane. I'd expect a log link function to work better. – Nick Cox May 19 '13 at 14:41

0 Answers0