3

Hi I get this output from R summary of an lm:

lm(formula = weight.nz ~ dChgs.nz)

Residuals:
     Min       1Q   Median       3Q      Max 
-15373.7   -664.4    243.3   1104.2   9137.2 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.853e+00  2.141e+01   0.087    0.931    
  dChgs.nz  7.036e+07  5.841e+06  12.046   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 1814 on 7464 degrees of freedom
Multiple R-squared: 0.01907,    Adjusted R-squared: 0.01894 
F-statistic: 145.1 on 1 and 7464 DF,  p-value: < 2.2e-16

I think small p suggest significance BUT r2 is tiny - so not significant.

I have read about problems of r2 on other questions on this subject that suggests I use stdError as a guide, but I am not sure how to interpret it. How do I use 1814 to tell me whether my result is significant or not?

ManInMoon
  • 183
  • 7
  • 5
    Statistically significant doesn't imply large $R^2$; with large $n$ even very tiny effects are distinguishable from chance; statistical significance is not practical significance. – Glen_b May 08 '13 at 11:49
  • @Glen_b So what should I be looking at in this output to tell me whether it is significant? – ManInMoon May 08 '13 at 11:56
  • 2
    That depends what 'it' is. You have already discovered that dChgs.nz is significantly different from zero in this model. This is probably more of a sign that you have three parameters estimated from >7000 cases. If you want to 'use standard error as a guide' you could get a confidence interval for the effect size by handing `confint` your fitted model. – conjugateprior May 08 '13 at 12:00
  • 2
    See http://stats.stackexchange.com/questions/58366/why-does-random-looking-data-give-a-really-good-model for a recent run through these issues. It is _statistically significant_ in the customary sense, but just a useless model practically. You can compare the residual SE with the SD of the response variable (outcome, dependent variable), but they will be almost equal, which is the whole point. – Nick Cox May 08 '13 at 12:03

2 Answers2

8

Statistically significant doesn't imply large $R^2$; with large $n$ even very tiny effects are distinguishable from chance; statistical significance is not practical significance.

As for significance, when you only have a single predictor, either the p-value next to the variable (dChgs.nz) or the p-value for the $F$ for the overall regression - they're the same (and indeed the square of the $t$ value for the coefficient of the variable is the $F$ for the regression).

The p-value in this case is very small, indicating it's clearly not just random variation making the coefficient different from zero.

But that doesn't mean it's telling you very much, other than "you have a really large sample size".

With multiple regression, the p-values for each variable is beside the variable name and the one for the overall regression is still associated with the F at the bottom.

Glen_b
  • 257,508
  • 32
  • 553
  • 939
6

The Multiple $R^2$ is the square of the correlation between the response and the fitted values. It tells you how much of the total variance is explained by the model's prediction. The $R^2$ doesn't tell you whether the model is significant or not. Of course: if you want a good prediction model, your aim is to get a high $R^2$. In this case, your model explains only about 1.9% of the total variance of the response. The $F$-statistic tests the hypothesis that all regression coefficients except the intercept are 0 versus the alternative that at least one coefficient differs from zero. Because you have only one coefficient besides the intercept, the $F$-value is simply the squared $t$-value of the coefficient: $t^2=F=12.046^2=145.1$. The $p$-value for the regression coefficient of dChgs.nz is a Wald-test and a test of the hypothesis that the regression coefficient is equal to 0 versus nonzero.

Recall that a simple regression model has the form: $$ y_{i}=\beta_{0} + \beta_{1}\cdot x_{i} + \varepsilon_{i} $$ Where $\varepsilon$ is the error term and is assumed to be distributed normally with mean $0$ and variance $\sigma^2$. The Residual standard error in the output is an estimate of $\sigma$ of the residuals. The degrees of freedom $df$ of the residuals are $n-2$ in this case, because two coefficients have been estimated (intercept and one coefficient).

Because you have a large number of observations ($n=7466$), even small effect sizes are significant. This doesn't mean, however, that these effect sizes are meaningful.

COOLSerdash
  • 25,317
  • 8
  • 73
  • 123