How to derive a regression formula

Question

In this paper: http://www.nber.org/papers/w19774.pdf, (ungated version here) the author spells out the following regression (equation 1 in the paper):

$s_{ij} = \alpha + \beta \bar{s_j} + \epsilon_{ij},$

where $i$ is individual, $j$ is school, $s_{ij}$ is outcome for individual $i$ in school $j$, and $\bar{s_j}$ is the average outcome for individuals at school $j$. Note that $i$ is included in the average for his/her school, $\bar{s_j}$.

How does the author come up with the left-hand side expression for $\hat{\beta}$ (reproduced below)? Also, what does $n_j$ mean? And why does this equal 1? I feel like more steps would help me grasp this better.

$\frac{\sum_j \sum_i s_{ij} (\bar{s_j} - \bar{s})}{\sum_j n_j (\bar{s_j} - \bar{s})^2} = \frac{\sum_j (\bar{s_j} - \bar{s})(n_j \bar{s_j})}{\sum_j n_j( \bar{s_j} - \bar{s})^2} = 1 $

Note that page 195 of Mostly Harmless Econometrics has something very similar.

When I tried to open the paper, I just got "Online access to NBER Working Papers denied, you have no subscription". — random_guy, Nov 29 '14 at 20:56
Hmm - sorry about that. I do think, however, that the paper is not necessary to answer the question. I'll add a little more in terms of notation. — bill999, Nov 29 '14 at 20:59
This appears to be ordinary least squares regression. You can find the formula for the slope estimate $\hat\beta$ written in many forms in many threads. For instance, try [this search](http://stats.stackexchange.com/search?q=+regression+slope+formula). It is also one formulation of (one-way) Analysis of Variance (ANOVA). — whuber, Nov 29 '14 at 21:04
Thank you. I think that how to apply this formula is not as obvious to me as it should be. If we have a usual regression: $y_i = \alpha + \beta x_i + \epsilon_i$, then the formula is: $\hat{\beta} = \frac{\sum_i (x_i - \bar{x})(y_i - \bar{y}}{\sum_i(x_i - \bar{x})^2}$. The difficulty for me is how to translate this into the above case where our outcome variable has both an $i$ and a $j$ subscript. Also, in the above equation, mathematically why does it equal $1$? This is not obvious to me from a math standpoint - but the intuition is helped a lot by @Andy's answer. — bill999, Nov 30 '14 at 20:55

Andy · Accepted Answer · 2014-12-01T08:41:42.797

As whuber pointed out, Angrist is just applying the formula for the OLS estimator to the above stated regression equation. From an interpretation point, regressing $$s_{ij} = \alpha + \beta \overline{s}_j + \epsilon_{ij}$$ is the same as if you were to first regress $$s_{ij} = a + \sum_j \delta_j D_j + e_{ij}$$ where $D_j$ are dummies for each school, and then you use the fitted values from this regression in $$s_{ij} = \alpha + \beta \widehat{s}_{ij} + \epsilon_{ij}$$

To convince yourself, you could try the following code which is written for Stata but the logic works for your preferred statistical package and example data set:

// insheet an example data set
webuse nlswork

// generate the group-average of earnings by race
egen av_wage = mean(ln_wage), by(race)

// regress individual wages on the group wages
reg ln_wage av_wage

// now do the same thing with the dummy variable method described here
reg ln_wage i.race
predict wage_hat, xb

reg ln_wage wage_hat

The regression coefficients will not be exactly 1 due to the rounding errors that computers make but the logic goes through nonetheless.

Another way to see the tautological nature of this regression is given in the published version of Angrist's paper. Assume you have a mean zero random variable $y$, if you take the mean of $y$ conditional on another random variable $z$, i.e. $E(y|z) = \mu_{y|z}$, then the regression of $y$ on this group-mean variable must give a coefficient of one. Using a formulation of the OLS estimator (maybe one that you find more familiar) the corresponding regression coefficient of $y$ on $\mu_{y|z}$ is $$\beta = \frac{E(y\mu_{y|z})}{\text{Var}(\mu_{y|z})}$$

Then take the numerator, apply the law of iterating expectations: $$ \begin{align} E(y\mu_{y|z}) &= E(E(y|z,\mu_{y|z})\cdot \mu_{y|z}) \newline &= E(E(y|z)\cdot \mu_{y|z}) \newline &= E(\mu_{y|z}^2) \newline &= \text{Var}(\mu_{y|z}) \end{align} $$ Substituting back into the formula for the OLS estiator you get $$\beta = \frac{E(y\mu_{y|z})}{\text{Var}(\mu_{y|z})} = \frac{\text{Var}(\mu_{y|z})}{\text{Var}(\mu_{y|z})} = 1$$

Thank you. This goes a long way towards laying out the intuition behind what is going on. — bill999, Nov 30 '14 at 21:04
@Andy Please let me know if you know an answer this question https://stats.stackexchange.com/questions/403715/how-to-derive-a-ranking-function-by-analysing-feature-correlations Looking forward to hearing from you :) — EmJ, Apr 18 '19 at 01:57

How to derive a regression formula

1 Answers1

Linked