14

Background

I'm trying to understand the first example in a course on fitting models (so this may seem ludicrously simple). I've done the calculations by hand and they match the example, but when I repeat them in R, the model coefficients are off. I thought the difference may be due to the textbook using population variance ($\sigma^2$) whereas R may be using sample variance ($S^2$), but I can't see where these are used in the calculations. For example, if lm() uses var() somewhere, the help section on var() notes:

The denominator n - 1 is used which gives an unbiased estimator of the (co)variance for i.i.d. observations.

I have looked at the code for both lm() and lm.fit() and neither make use of var(), but lm.fit() passes that data to compiled C code (z <- .Call(C_Cdqrls, x, y, tol, FALSE)) which I don't have access to.

Question

Can anyone explain why R is giving different results? Even if there is a difference in using sample vs population variance, why do the coefficient estimates differ?

Data

Fit a line to predict shoe size from grade in school.

# model data
mod.dat <- read.table(
    text = 'grade shoe
                1    1
                2    5
                4    9'
    , header = T);

# mean
mod.mu  <- mean(mod.dat$shoe);
# variability 
mod.var <- sum((mod.dat$shoe - mod.mu)^2)

# model coefficients from textbook
mod.m  <- 8/3;
mod.b  <- -1;

# predicted values  ( 1.666667 4.333333 9.666667 )
mod.man.pred       <- mod.dat$grade * mod.m + mod.b;
# residuals         ( -0.6666667  0.6666667 -0.6666667 )
mod.man.resid      <- (mod.dat$shoe - mod.man.pred)
# residual variance ( 1.333333 )
mod.man.unexpl.var <- sum(mod.man.resid^2);
# r^2               ( 0.9583333 )
mod.man.expl.var   <- 1 - mod.man.unexpl.var / mod.var;

# but lm() gives different results:
summary(lm(shoe ~ grade, data = mod.dat))
Call:
lm(formula = shoe ~ grade, data = mod.dat)

Residuals:
      1       2       3 
-0.5714  0.8571 -0.2857 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  -1.0000     1.3093  -0.764    0.585
grade         2.5714     0.4949   5.196    0.121

Residual standard error: 1.069 on 1 degrees of freedom
Multiple R-squared:  0.9643,    Adjusted R-squared:  0.9286 
F-statistic:    27 on 1 and 1 DF,  p-value: 0.121

Edit

As Ben Bolker has shown, it looks like teachers make mistakes sometimes. It seems that R calculations are correct. Moral of the story: don't believe something just because a teacher says it is true. Verify it for yourself!

post-hoc
  • 677
  • 1
  • 6
  • 14
  • Can you edit the question to contain also the calculations you made by hand to match the textbook result? – Juho Kokkala Jun 24 '14 at 22:54
  • @JuhoKokkala they are in the comments already. I tried to save space, but maybe it's not obvious. – post-hoc Jun 24 '14 at 22:55
  • 2
    Double check `mod.m=8/3`. Because if you set `mod.m=2.5714`, then they seem to be identical. – Stat Jun 24 '14 at 23:00
  • 2
    The coefficients mod.m = 8/3 and mod.b = -1 are not computed anywhere in the comments as far as I understand, so it's not obvious. As @Stat comments above, the error seems to be in computing mod.m. – Juho Kokkala Jun 24 '14 at 23:02
  • @Stat the values `8/3` and `-1` is listed in the text as the values which produce the smallest error by solving this equation: $21m^2+14mb+3b^2-94m-30b+81$. To be honest, I don't completely understand that part, but perhaps you are suggesting that the estimates have been rounded for pedagogic purposes? – post-hoc Jun 24 '14 at 23:16
  • @JuhoKokkala You are right. I just accepted that value from the textbook because I really don't understand that part. The coefficients are supposed to be the best solution to minimise the error of the line, but perhaps they are rounded to make the other calculations easier? He doesn't explain how he arrived at `8/3` and `-1` and, to be honest, I have no idea how one would arrive there either! Should I just trust ℝ's results? I ask this question because I'm trying to re-analyse someone else's results and I'm having the same problem of ℝ not giving the same estimates as the paper's author. – post-hoc Jun 24 '14 at 23:19
  • 3
    It's important to keep in mind that *anyone* can make mistakes - your teacher, you, answerers here, the R programmers - anyone. So when trying to figure out where mistakes may lie when things disagree, consider how many other people are checking each thing. In the case of the `lm` function in R, literally tens of thousands of people have checked the results by comparing them with other things, and the output of `lm` is checked against known examples each time anything changes in the code. With answers here, at least a few people are likely to check (your question has been looked at 29 times). – Glen_b Jun 25 '14 at 01:22
  • 1
    @Glen_b Your point is actually the reason why I came here to ask. I couldn't understand how R could be wrong on such a basic calculation, but I couldn't figure out why they were different. I event snooped around the source code. But in the end, the error was in the last place I thought to look, mostly because the calculus part is at the limits of my knowledge. I learnt a lot from the answer though! – post-hoc Jun 25 '14 at 01:24
  • 2
    Yes, it's important to try to figure out why they differ; it makes sense to ask here if you can't work it out. I was trying to suggest why the last place you considered might instead have been one of the first places to look. I've been caught by making last-minute 'simplifying' changes to examples on one or two occasions myself. – Glen_b Jun 25 '14 at 01:27

1 Answers1

25

It looks like the author made a mathematical error somewhere.

If you expand the sum-of-squares deviation

$$ S = ((b+m)-1)^2+ ((b+2m)-5)^2 + ((b+4m)-9)^2 $$ you get $$ \begin{split} S = & b^2+2 b m+ m^2 + 1 - 2 b - 2 m \\ + & b^2+4 b m+ 4 m^2 + 25 - 10 b -20 m \\ + & b^2+8 b m+16 m^2 + 81 - 18 b -72 m \end{split} $$

which reduces to $$ 3 b^2 + 14 b m + 21 m^2 + 107 - 30 b - 94 m $$ which is the same as the author's expression, except the constant term, which doesn't matter anyway).

Now we need to try to minimize this by setting the derivatives of $S$ with respect to $b$ and $m$ to zero and solving the system. $$ dS/db = 6 b + 14 m -30 \to 3 b +7 m-15 = 0 $$ $$ dS/dm = 14 b +42 m -94 \to 7 b + 21 m -47 = 0 $$

Solve

$$ \begin{split} b & = (15-7m)/3 \\ 0 & = 7 (15-7m)/3 + 21 m-47 \\ 47 - 35 & = (-49/3 + 21) m \\ m & = (47-35)/(21-49/3) = 18/7 \end{split} $$

R says this is indeed 2.571429 ...

Based on this link this seems to be from a Coursera course ... ? Maybe there was a mis-transcription of the data somewhere?

The other, independent way to do this calculation is to know that the estimated regression slope is equal to the sum of cross products ($\sum (y-\bar y) (x-\bar x)$) divided by the sum of squares ($\sum (x-\bar x)^2$).

g <- c(1,2,4)
g0 <- g - mean(g)
s <- c(1,5,9)
s0 <- s- mean(s)
sum(g0*s0)/(sum(g0^2))
## [1] 2.571429

If think if the shoe sizes were $\{1,11/3,9\}$ instead of $\{1,5,9\}$ then the slope would come out to 8/3 ...

Ben Bolker
  • 34,308
  • 2
  • 93
  • 126
  • 2
    Wow. Yes, you are right. It's from a Coursera course and it's from the video, not transcription. So I'm guessing he simplified it to make the calculations simpler for the video and didn't expect anyone to try and repeat it. It just happened to be the first video that I saw so I tried to follow along. It's clear that I need to upskill when it comes to maths. I think found the error though. The constant term, which you say doesn't matter, is probably the correct value which through off his calculations. I'll look through your answer a few more times to teach myself. I really appreciate it! – post-hoc Jun 25 '14 at 00:58
  • I don't think the constant term will throw off the calculations. It won't affect the estimates of the slope and intercept (it disappears when we take the derivative), only the estimates of the residual SSQ/standard deviation. – Ben Bolker Jun 26 '14 at 17:44