1

After deriving the formula myself in an unnecessarily complicated way, I discovered this very old thread:

https://stats.stackexchange.com/a/89155/148856

There is however a step that I don't understand, namely the last one here:

\begin{align} \text{Var}(\hat{\beta_1}) & = \text{Var} \left(\frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{\sum_i (x_i - \bar{x})^2} \right) \\ &= \text{Var} \left(\frac{\sum_i (x_i - \bar{x})(\beta_0 + \beta_1x_i + u_i )}{\sum_i (x_i - \bar{x})^2} \right), \;\;\;\text{substituting in the above} \\ &= \text{Var} \left(\frac{\sum_i (x_i - \bar{x})u_i}{\sum_i (x_i - \bar{x})^2} \right), \;\;\;\text{noting only $u_i$ is a random variable} \\ \end{align}

If I am reading this correctly, it seems to imply that:

$Var(\sum_i (x_i- \bar x)(\beta_0+\beta_1 x_i +u_i)) =$
$ = Var(\sum_i (x_i- \bar x)u_i) + 2 Cov(\sum_i (x_i- \bar x)(\beta_0+\beta_1 x_i),\sum_i (x_i- \bar x)(u_i)) + Var(\sum_i (x_i- \bar x)(\beta_0+\beta_1 x_i)) =$
$= Var(\sum_i (x_i- \bar x)u_i)$

i.e. that:

$2 Cov(\sum_i (x_i- \bar x)(\beta_0+\beta_1 x_i),\sum_i (x_i- \bar x)(u_i)) + Var(\sum_i (x_i- \bar x)(\beta_0+\beta_1 x_i)) = 0$

So, if I define:

$S_A = \sum_i (x_i- \bar x)(\beta_0+\beta_1 x_i +u_i) = \sum_i (x_i- \bar x)y_i$
$S_B = \sum_i (x_i- \bar x)u_i = \sum_i (x_i- \bar x)(y_i - y_{pred})$
$S_C = \sum_i (x_i- \bar x)(\beta_0+\beta_1 x_i) = \sum_i (x_i- \bar x)(y_{pred})$

then I should find:

$S_A = S_B + S_C$
$2Cov(S_B,S_C) + Var(S_C) = 0$

If I run this R script:

sigma = 0.1
N = 10

x <- runif(N, -10, 10)
x_mean <- mean(x)
y_true <- 2 * x - 7

set.seed(1324)
out <- t(replicate(1000,{
  
  y <- y_true + rnorm(N, mean = 0, sd = sigma)

  lm1 <- lm(y ~ x)
  beta_0 <- coef(lm1)[1]
  beta_1 <- coef(lm1)[2]
  y_pred <- beta_0 + beta_1 * x
  u <- y - y_pred
  S_A <- sum((x-x_mean)*y)
  S_B <- sum((x-x_mean)*u)
  S_C <- sum((x-x_mean)*y_pred)
  return(c(beta_0, beta_1, S_A, S_B, S_C))
  
}))

dimnames(out)[[2]] <- c("beta_0","beta_1","S_A","S_B","S_C")

plot(out[,"S_A"],out[,"S_B"] + out[,"S_C"]); abline(0,1)

The 'sanity check' that $S_A = S_B + S_C$ works.
But then:

var(out[,"S_A"])
#[1] 2.721668
var(out[,"S_B"])
#[1] 8.436618e-27
var(out[,"S_C"])
#[1] 2.721668
cov(out[,"S_B"],out[,"S_C"])
#[1] -2.610527e-15

So it seems that the variance of the original term is in $S_C$ rather than in $S_B$.

And it's weird, because the rest of the derivation (after this step, and the final result obviously) makes sense to me, so I cannot reconcile the numerical results with the theory.

Can you see where I am going wrong?

NOTE: yes, I saw this thread Understanding simplification of constants in derivation of variance of regression coefficient , but I am back to square one, because at some point it says:

\begin{equation} \begin{aligned} &= \mathbb{Var} \left(\frac{\sum_i (x_i - \bar{x})(\beta_0 + \beta_1 x_i) }{\sum_i (x_i - \bar{x})^2} + \frac{\sum_i (x_i - \bar{x}) u_i}{\sum_i (x_i - \bar{x})^2} \right) \\[6pt] &= \mathbb{Var} \left( \text{const} + \frac{\sum_i (x_i - \bar{x}) u_i}{\sum_i (x_i - \bar{x})^2} \right) \\[6pt] &= \mathbb{Var} \left( \frac{\sum_i (x_i - \bar{x}) u_i}{\sum_i (x_i - \bar{x})^2} \right). \\[6pt] \end{aligned} \end{equation}

and I still do not see how:

$\frac{\sum_i (x_i - \bar{x})(\beta_0 + \beta_1 x_i) }{\sum_i (x_i - \bar{x})^2} = constant$

If I could see that, of course the rest comes simply from $Var(constant + X) = Var(X)$.
But in my simulation this term is definitely not a constant :/


EDIT correction according to John's answer.

$Var(\sum_i (x_i- \bar x)y_i) = Var(\sum_i (x_i- \bar x)(y_{true,i}+u_i)) = $
$= Var(\sum_i (x_i- \bar x)y_{true,i}+\sum_i (x_i- \bar x)u_i) = $
$= Var(\sum_i (x_i- \bar x)u_i)$

And the R script:

sigma = 0.1
N = 10

x <- runif(N, -10, 10)
x_mean <- mean(x)
y_true <- 2 * x - 7

set.seed(1324)
out <- t(replicate(1000,{
  
  y <- y_true + rnorm(N, mean = 0, sd = sigma)

  lm1 <- lm(y ~ x)
  beta_0 <- coef(lm1)[1]
  beta_1 <- coef(lm1)[2]
  y_pred <- beta_0 + beta_1 * x
  #u <- y - y_pred
  u <- y - y_true
  S_A <- sum((x-x_mean)*y)
  S_B <- sum((x-x_mean)*u)
  #S_C <- sum((x-x_mean)*y_pred)
  S_C <- sum((x-x_mean)*y_true)
  return(c(beta_0, beta_1, S_A, S_B, S_C))
  
}))

dimnames(out)[[2]] <- c("beta_0","beta_1","S_A","S_B","S_C")

var(out[,"S_A"])
#[1] 3.491618
var(out[,"S_B"])
#[1] 3.491618
var(out[,"S_C"])
#[1] 0
cov(out[,"S_B"],out[,"S_C"])
#[1] 0

I hope this makes sense now.

user6376297
  • 539
  • 2
  • 10

1 Answers1

1

$\sum_i (x_i- \bar x)(\beta_0+\beta_1 x_i +u_i) =\sum_i (x_i- \bar x)(\beta_0+\beta_1 x_i )+\sum_i (x_i- \bar x)u_i$

Now, use the fact that if $a$ is a constant and $X$ is a random variable, then $Var(a+X)=Var(X)$ to deduce that $Var(\sum_i (x_i- \bar x)(\beta_0+\beta_1 x_i +u_i))= Var(\sum_i (x_i- \bar x)u_i)$

After you have fixed $x_1,...,x_n, \beta_0,\beta_1$ the variance of $S_C$ must be $0$ because every time you simulate a dataset, $S_C$ will be exactly the same as it is every other time.

In your R program, change this line

 y_pred <- beta_0 + beta_1 * x

to

 y_pred <- -7 + 2 * x

Then, it will work. You have to use the true value of $\beta_0,\beta_1$, not the estimated.

John L
  • 2,140
  • 6
  • 15
  • Thank you! I was indeed wondering why the $\beta$'s did not have the 'hat'... So in essence we are saying that $y_i = y_{true,i} + u_i$, where $y_{true,i} = \beta_0 +\beta_1 x_i$. I will add an edit session to my post to acknowledge this answer. – user6376297 Apr 03 '21 at 16:17