3

Imagine predicting salary of some professors from their years of experience (time) controlling for/holding constant their number of publications (pubs).

Question: Is the following regarding the meaning of holding constant their number of pubs correct, and demonstrable via simulation in R?

Imagine we had countless professors, then take a sample of them with the exact same number of pubs (e.g., $1$).

  • Fit a regression model with only time as predictor, get the regression coef of time.
  • Take another sample with pubs of $2$, Fit the regression model again, get the regression coef of time.
  • Keep changing pubs to $3, 4,…$ and each time get the regression coef of time.

At the end, average of our regression coefs of time will be a partial regression coefficient that has controlled for the pubs of professors while predicting salary from time.

p.s. Is controlling for a predictor similar to integrating it out?

rnorouzian
  • 3,056
  • 2
  • 16
  • 40
  • 1
    Isn't the coefficient of time in the straightforward model `salary ~ time + pubs` already a partial regression coefficient that has controlled for `pubs` (that is 435.3 in your code above)? Are you concerned with controlling for nonlinear effects in `pubs`? – Dex Groves Aug 16 '20 at 07:06
  • 5
    @DexGroves, OP's question is about the deeper meaning of controlling for/holding constant a predictor (as s/he describes it) and whether this concept is demonstrable via an R simulation. – Reza Aug 16 '20 at 07:24
  • 1
    You would not want a simple average as that would put too much weight on cases with uncommon numbers of publications. – Henry Aug 16 '20 at 10:30
  • 1
    @Henry, would you please clarify via a simulation what you exactly mean? – Reza Aug 16 '20 at 14:53
  • Although this is one way to control for numbers of publications, note that it would be most closely related to a multiple regression in which `pubs` is introduced as a *categorical* variable rather than as a numerical count. Concerning your PS, see https://stats.stackexchange.com/questions/17336/how-exactly-does-one-control-for-other-variables – whuber Aug 16 '20 at 19:17
  • @whuber, is a simulation possible? – rnorouzian Aug 16 '20 at 19:27
  • What do you want to simulate and why? – whuber Aug 16 '20 at 19:54
  • @whuber, simulating the exact same scenario I describe in my question, to show that it holds, for instructional purposes and being able to produce helpful graphs etc. – rnorouzian Aug 16 '20 at 19:58

1 Answers1

1

Yes, if the model is correctly specified.

Suppose your data is generated by $$ y = \beta_1 x_1 + \beta_2 x_2 + \epsilon, \mbox{ where } E[\epsilon|x_1, x_2] = 0, $$ i.e. $$ E[y|x_1, x_2] = \beta_1 x_1 + \beta_2 x_2. $$ Suppose $x_1$ is the predictor of interest and $x_2$ is control. Conditioning on the control $x_2$ gives $$ E[y|x_2] = \beta_1 E[x_1|x_2] + \beta_2 x_2. \quad (*) $$

The empirical counterpart of $(*)$ is the regression you're suggesting---regress $y$ on $x_1$ (with intercept) for a given value of $x_2$. Note that for any given value of $x_2$, this regression conditional on $x_2$ is already a unbiased estimator of $\beta_1$.

Averaging over $x_2$ makes estimate less noisy. The assumption $E[\epsilon|x_1, x_2] = 0$ implies samples are uncorrelated across $x_2$. Therefore averaging over $x_2$ gives a smaller standard error.

Comment

The statement "the regression conditional on $x_2$ is a unbiased estimator of $\beta_1$" is contingent upon correct specification---correct functional form/no omitted variables/etc. In a real data set, you would have to willing to believe/claim true functional form is linear/no controls are omitted/etc.

If the true population regression function is not linear but $E[\epsilon|x_1, x_2] = 0$ still holds, I would expect averaging the OLS coefficient for $x_1$ from the regression conditional on $x_2$, call it $\hat{\beta}_1|x_2$, over $x_2$ to be close to the OLS coefficient $\hat{\beta}_1$.

Michael
  • 2,853
  • 10
  • 15