How to interpret the output of the summary method for an lm object in R?

Question

I am using sample algae data to understand data mining a bit more. I have used the following commands:

data(algae)
algae <- algae[-manyNAs(algae),]
clean.algae <-knnImputation(algae, k = 10)
lm.a1 <- lm(a1 ~ ., data = clean.algae[, 1:12])
summary(lm.a1)

Subsequently I received the results below. However I can not find any good documentation which explains what most of this means, especially Std. Error,t value and Pr.

Can someone please be kind enough to shed some light please? Most importantly, which variables should I look at to ascertain on whether a model is giving me good prediction data?

Call:
lm(formula = a1 ~ ., data = clean.algae[, 1:12])

Residuals:
  Min      1Q  Median      3Q     Max 
  -37.679 -11.893  -2.567   7.410  62.190 

  Coefficients:
                Estimate Std. Error t value Pr(>|t|)   
  (Intercept)  42.942055  24.010879   1.788  0.07537 . 
  seasonspring  3.726978   4.137741   0.901  0.36892   
  seasonsummer  0.747597   4.020711   0.186  0.85270   
  seasonwinter  3.692955   3.865391   0.955  0.34065   
  sizemedium    3.263728   3.802051   0.858  0.39179   
  sizesmall     9.682140   4.179971   2.316  0.02166 * 
  speedlow      3.922084   4.706315   0.833  0.40573   
  speedmedium   0.246764   3.241874   0.076  0.93941   
  mxPH         -3.589118   2.703528  -1.328  0.18598   
  mnO2          1.052636   0.705018   1.493  0.13715   
  Cl           -0.040172   0.033661  -1.193  0.23426   
  NO3          -1.511235   0.551339  -2.741  0.00674 **
  NH4           0.001634   0.001003   1.628  0.10516   
  oPO4         -0.005435   0.039884  -0.136  0.89177   
  PO4          -0.052241   0.030755  -1.699  0.09109 . 
  Chla         -0.088022   0.079998  -1.100  0.27265   
  ---
  Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

  Residual standard error: 17.65 on 182 degrees of freedom
  Multiple R-squared:  0.3731,    Adjusted R-squared:  0.3215 
  F-statistic: 7.223 on 15 and 182 DF,  p-value: 2.444e-12

An annotated regression output can be found at: http://www.ats.ucla.edu/stat/stata/output/reg_output.htm The layout of the output might look a little different (it's using STATA rather than R) but the content is more or less the same. Hope this helps. — Graeme Walsh, May 17 '13 at 00:21
You'll also want to read this: [interpretation-of-rs-lm-output](http://stats.stackexchange.com/questions/5135/). After having read those, see if you still have any questions left, & if you do, edit your Q to clarify what you still need to know. — gung - Reinstate Monica, May 17 '13 at 01:36

score 32 · Accepted Answer · edited Oct 02 '17 at 16:16

It sounds like you need a decent basic statistics text that covers at least basic location tests, simple regression and multiple regression.

Std. Error,t value and Pr.

Std. Error is the standard deviation of the sampling distribution of the estimate of the coefficient under the standard regression assumptions. Such standard deviations are called standard errors of the corresponding quantity (the coefficient estimate in this case).

In the case of simple regression, it's usually denoted $s_{\hat \beta}$, as here. Also see this

For multiple regression, it's a little more complicated, but if you don't know what these things are it's probably best to understand them in the context of simple regression first.
t value is the value of the t-statistic for testing whether the corresponding regression coefficient is different from 0.

The formula for computing it is given at the first link above.
Pr. is the p-value for the hypothesis test for which the t value is the test statistic. It tells you the probability of a test statistic at least as unusual as the one you obtained, if the null hypothesis were true. In this case, the null hypothesis is that the true coefficient is zero; if that probability is low, it's suggesting that it would be rare to get a result as unusual as this if the coefficient were really zero.

Most importantly, which variables should I look at to ascertain on whether a model is giving me good prediction data?

What do you mean by 'good prediction data'? Can you make it clearer what you're asking?

The Residual standard error, which is usually called $s$, represents the standard deviation of the residuals. It's a measure of how close the fit is to the points.

The Multiple R-squared, also called the coefficient of determination is the proportion of the variance in the data that's explained by the model. The more variables you add - even if they don't help - the larger this will be. The Adjusted one reduces that to account for the number of variables in the model.

The $F$ statistic on the last line is telling you whether the regression as a whole is performing 'better than random' - any set of random predictors will have some relationship with the response, so it's seeing whether your model fits better than you'd expect if all your predictors had no relationship with the response (beyond what would be explained by that randomness). This is used for a test of whether the model outperforms 'noise' as a predictor. The p-value in the last row is the p-value for that test, essentially comparing the full model you fitted with an intercept-only model.

Where do the data come from? Is this in some package?

score 7 · Answer 2 · edited Apr 13 '17 at 12:44

7

The Standard error is an estimate of the variance of the strength of the effect, or the strength of the relationship between each causal variable and the predicted variable. If it's high, then the effect size will have to be stronger for us to be able to be sure that it's a real effect, and not just an artefact of randomness.

The t-statistic is an estimate of how extreme the value you see is, relative to the standard error (assuming a normal distribution, centred on the null hypothesis).

The p-value is an estimate of the probability of seeing a t-value as extreme, or more extreme the one you got, if you assume that the null hypothesis is true (the null hypothesis is usually "no effect", unless something else is specified). So if the p-value is very low, then there is a higher probability that you're seeing data that is counter-indicative of zero effect. In other situations, you can get a p-value based on other statistics and variables.

Unfortunately, if that explanation of the p-value is confusing, that's because the entire concept is confusing. It's important to note that technically a low p-value does not show high probability of an effect, although it may indicate that. Have a read of some of the high-voted p-value questions, to get an idea about what's going on here.

edited Apr 13 '17 at 12:44

Community

1

answered May 17 '13 at 00:36

naught101

4,973
1
51
85

please correct me if i am wrong but the higher the standard error the stronger the prediction model? – godzilla May 17 '13 at 00:43
4

This is not correct. High standard errors tell you that you can't estimate the coefficient very precisely - the 'true' coefficient may well be far away from your estimated value (the standard error is like a 'typical distance' away). – Glen_b May 17 '13 at 00:45
@godzilla: if the std.err goes up, then the distribution of likely values is widening, which means that your effect size will become swamped, so making predictions will be harder. – naught101 May 17 '13 at 00:50
thanks this has helped a lot, can you please give me more clarity as to the t value please? – godzilla May 17 '13 at 01:11
@godzilla: um... not unless you specify what it is that you want to know. – naught101 May 17 '13 at 01:15
what is the relation between t value and std error? How does this effect the p value? – godzilla May 17 '13 at 01:17
1

@godzilla: I think you really need to read an introductory stats text, and/or the wikipedia pages I linked to. I answered those exact questions in my answer. If you want detail, then ask for specifics. – naught101 May 17 '13 at 01:22
to be honest those examples go into a lot of mathematics which is hard to follow, i just want a simple example i can follow so i can get a clear idea in my head – godzilla May 17 '13 at 01:24
@godzilla: Perhaps try some of the answers at http://stats.stackexchange.com/questions/31/what-is-the-meaning-of-p-values-and-t-values-in-statistical-tests – naught101 May 17 '13 at 01:26
1

@godzilla For t-values, the most simple explanation is that you can use 2 (as a rule of thumb) as the threshold to decide whether or not a variable is statistically significant. Above two and the variable is statistically significant and below zero is not statistically significant. For an easy treatment of this material see Chapter 5 of Gujarati's Basic Econometrics. The ucla link I provided in another comment explains how interpret the p value. I assume its the interpretation of the output for practical use that you want rather than the actual underlying theory hence my oversimplification. – Graeme Walsh May 17 '13 at 14:02
@naught101: What does "effect size" mean? – stackoverflowuser2010 Feb 09 '14 at 19:48
The p-value is the probability that the attribute is not relevant, right? Why don't you just write that instead of "So if the p-value is very low, then there is a higher probability that you're seeing data that is counter-indicative of zero effect." – stackoverflowuser2010 Feb 09 '14 at 20:13

How to interpret the output of the summary method for an lm object in R?

2 Answers2

Linked

Related