Holding other predictors constant via simulation in R

Question

Imagine predicting salary of some professors from their years of experience (time) controlling for/holding constant their number of publications (pubs).

Question: Is the following regarding the meaning of holding constant their number of pubs correct, and demonstrable via simulation in R?

Imagine we had countless professors, then take a sample of them with the exact same number of pubs (e.g., $1$).

Fit a regression model with only time as predictor, get the regression coef of time.
Take another sample with pubs of $2$, Fit the regression model again, get the regression coef of time.
Keep changing pubs to $3, 4,…$ and each time get the regression coef of time.

At the end, average of our regression coefs of time will be a partial regression coefficient that has controlled for the pubs of professors while predicting salary from time.

p.s. Is controlling for a predictor similar to integrating it out?

Isn't the coefficient of time in the straightforward model `salary ~ time + pubs` already a partial regression coefficient that has controlled for `pubs` (that is 435.3 in your code above)? Are you concerned with controlling for nonlinear effects in `pubs`? — Dex Groves, Aug 16 '20 at 07:06
@DexGroves, OP's question is about the deeper meaning of controlling for/holding constant a predictor (as s/he describes it) and whether this concept is demonstrable via an R simulation. — Reza, Aug 16 '20 at 07:24
You would not want a simple average as that would put too much weight on cases with uncommon numbers of publications. — Henry, Aug 16 '20 at 10:30
@Henry, would you please clarify via a simulation what you exactly mean? — Reza, Aug 16 '20 at 14:53
Although this is one way to control for numbers of publications, note that it would be most closely related to a multiple regression in which `pubs` is introduced as a *categorical* variable rather than as a numerical count. Concerning your PS, see https://stats.stackexchange.com/questions/17336/how-exactly-does-one-control-for-other-variables — whuber, Aug 16 '20 at 19:17
@whuber, simulating the exact same scenario I describe in my question, to show that it holds, for instructional purposes and being able to produce helpful graphs etc. — rnorouzian, Aug 16 '20 at 19:58

Michael · Answer 1 · 2020-08-16T22:02:11.053

Yes, if the model is correctly specified.

Suppose your data is generated by $$ y = \beta_1 x_1 + \beta_2 x_2 + \epsilon, \mbox{ where } E[\epsilon|x_1, x_2] = 0, $$ i.e. $$ E[y|x_1, x_2] = \beta_1 x_1 + \beta_2 x_2. $$ Suppose $x_1$ is the predictor of interest and $x_2$ is control. Conditioning on the control $x_2$ gives $$ E[y|x_2] = \beta_1 E[x_1|x_2] + \beta_2 x_2. \quad (*) $$

The empirical counterpart of $(*)$ is the regression you're suggesting---regress $y$ on $x_1$ (with intercept) for a given value of $x_2$. Note that for any given value of $x_2$, this regression conditional on $x_2$ is already a unbiased estimator of $\beta_1$.

Averaging over $x_2$ makes estimate less noisy. The assumption $E[\epsilon|x_1, x_2] = 0$ implies samples are uncorrelated across $x_2$. Therefore averaging over $x_2$ gives a smaller standard error.

Comment

The statement "the regression conditional on $x_2$ is a unbiased estimator of $\beta_1$" is contingent upon correct specification---correct functional form/no omitted variables/etc. In a real data set, you would have to willing to believe/claim true functional form is linear/no controls are omitted/etc.

If the true population regression function is not linear but $E[\epsilon|x_1, x_2] = 0$ still holds, I would expect averaging the OLS coefficient for $x_1$ from the regression conditional on $x_2$, call it $\hat{\beta}_1|x_2$, over $x_2$ to be close to the OLS coefficient $\hat{\beta}_1$.

Holding other predictors constant via simulation in R

1 Answers1

Linked