Endogenous controls in linear regression - Alternative approach?

Question

I have a cross-section of $x$, $y_1$, and $y_2$. These are individual level data used in labor economics. I have random variation in $x$ and I'm interested in the effect of $x$ on $y_1$. It is well established in earlier research that $x$ causally affects $y_2$ positively. Economic intuition says that $y_2$ will affect $y_1$ positively as well. Hence, $x$ has an effect on $y_1$ through $y_2$.

Estimating the regression $y_1 = a_1+a_2x + a_3y_2$ I believe is problematic since $y_2$ is endogenous. What type of model(s) can I estimate to identify the "direct" effect of $x$ on $y_1$ (the $a_2$ parameter) while controlling for the effect $x$ has on $y_1$ through $y_2$?

could you explain what real-world variables are behind $y_1$, $y_2$ and $x$? This would help sorting out your problem. — E. Sommer, Oct 27 '18 at 09:50
Your OLS estimate for $a_2$ would be biased not because $y_2$ is endogenous, but because $x$ itself is endogenous due to multicolinearity between $x$ and $y_2$. One strategy would be to find a valid IV that affects $x$, but not $y_2$. This IV would affect $y_1$ only through $x$. The first stage of your IV estimate would sort of partial out the effect of $x$ on $y_1$ that is uncorrelated with $y_2$. — BellmanEqn, Mar 14 '19 at 04:11
@sgtbp; First of all you have to realize that concepts like “endogeneity” and “direct effect” are causal. My reply here (https://stats.stackexchange.com/questions/493211/under-which-assumptions-a-regression-can-be-interpreted-causally/493905#493905) can be interesting for you. Said that, you give some causal assumptions above. — markowitz, Nov 21 '20 at 14:48
Moreover you write “Hence, $x$ has an effect on $y_1$ through $y_2$. … can I estimate to identify the "direct" effect of $x$ on $y_1$ …?”. From your words is not clear if you assume direct effect (of $x$ on $y_1$) and want to estimate it, or if you do not assume it. This point is crucial for write down correctly the associated structural equations. Now I'm too busy but if you clear this point i give you my reply later. — markowitz, Nov 21 '20 at 14:48
Thanks for you reply, @markowitz. I’m struggling to see what you need clarified. However, I’m interested in the causal or direct effect of $x$ on $y_1$. To be clear, in the model I pose in my question the effect of $x$ on $y_1$ is given by $\frac{\partial y_1}{\partial x} = a_2 + a_3 \frac{\partial y_2}{\partial x}$. I want an unbiased and consistent estimate of $a_2$. My challenge is that it is not reasonable to assume the latter derivative to be zero. — sgtbp, Nov 22 '20 at 15:29
@sgtbp, I see that you do not accept my answer, no ask clarifications nor give opinions. What do you think? — markowitz, Dec 05 '20 at 20:31

score 0 · Answer 1 · answered Oct 26 '18 at 18:48

0

If $y_2$ can be observed, there is nothing wrong with your approach. You are interested in the effect of $x$ on $y_1$ and you control for the partial influence of $y_2$. Estimating

$$ y_1 = \alpha + \beta_1 x + \beta_2 y_2 + \varepsilon $$

will yield a unbiased estimate of the effect of $x$ on $y_1$ if everything else (captured by $\varepsilon$) does not affect $y_1$ for given values of $x$ and $y_2$. I might have misunderstood what you mean by the 'direct effect', but $\widehat{\beta_1}$ will give you the effect of $x$ on $y_1$ net of the impact from $y_2$.

answered Oct 26 '18 at 18:48

E. Sommer

462
2
10

\beta_1 is biased due to endogeneity – BellmanEqn Oct 26 '18 at 19:57
I don't think you misunderstood what I meant, but I'm pretty sure that @bellmaneqn is right about the endogeneity problem. – sgtbp Oct 26 '18 at 21:06

markowitz · Answer 2 · 2020-11-28T10:16:08.230

You are interested in causal inference with linear models, therefore with linear regression. In situation like that we have to deal with regression and causality, this is a widespread and slippery problem. I tried to summarize this problem here: Under which assumptions a regression can be interpreted causally?

Following the approach that I suggest there, we have to write down some structural causal equations that encode the causal assumptions. My question in the comments come from clarifications about these. From what you said it seems that we have two structural equations:

$y_1 = \beta_1 y_2 + \epsilon_1$

$y_2 = \beta_1 x + \epsilon_2$

So there is not direct effect of $x$ on $y_1$; however there is and indirect one. Indeed we can see that $y_1 = \beta_1 \beta_2 x + \beta_1 \epsilon_2 + \epsilon_1 = \beta_3 x + \epsilon_3$

and $\beta_3 = \beta_1 \beta_2$ represents the indirect (and total) effect of $x$ on $y_1$

Now I add some needed (causal) assumptions more. In the initial two structural equation the structural errors are exogenous ($E[\epsilon_1 | y_2]=0$ and $E[\epsilon_2 | x]=0$) and them are independent.

So, as consequence, in the last structural equations the structural error $\epsilon_3$ is exogenous ($E[\epsilon_3 | x]=0$)

Then, you can perform the regression $y_1 = \theta_1 x + u_1$ and $\theta_1$ identify $\beta_3$, what you looking for.

Moreover in this example a regression $y_1 = \theta_2 y_2 + u_2$ the coefficient $\theta_2$ identify $\beta_1$.

Modifying the model as suggested in comments we have two structural equations:

$y_1 = \beta_1 y_2 + \beta_2 x + \epsilon_1$

$y_2 = \beta_3 x + \epsilon_2$

Here $\beta_2$ is the direct effect of $x$ on $y_1$; what we are interested in. Moreover there is an indirect one too. Now, we can see that

$y_1 = \beta_1 \beta_3 x + \beta_2 x + \beta_1 \epsilon_2 + \epsilon_1 = \beta_4 x + \epsilon_3$

where $\beta_4 = \beta_1 \beta_3 + \beta_2$ represents the total effect of $x$ on $y_1$

and $\epsilon_3 = \beta_1 \epsilon_2 + \epsilon_1 $

Now, as before, I add some needed (causal) assumptions more. In the initial two structural equation the structural errors are exogenous ($E[\epsilon_1 | y_2, x]=0$ and $E[\epsilon_2 | x]=0$) and them are independent.

So, as consequence, in the last structural equations the structural error $\epsilon_3$ is exogenous too ($E[\epsilon_3 | x]=0$)

Then, you can perform three useful regressions

$y_1 = \theta_1 x + u_1$

$y_2 = \theta_2 x + u_2$

$y_1 = \theta_3 y_2 + \theta_4 x + u_3$

here $\theta_1$ identify $\beta_4$, $\theta_2$ identify $\beta_3$, $\theta_3$ and $\theta_4$ identify $\beta_1$ and $\beta_2$ (what you looking for).

Moreover from $\theta_1 - \theta_3 \theta_2$ (like $\beta_4 - \beta_1 \beta_3$) we identify $\beta_2$ again. So if the restriction
$\theta_4 = \theta_1 - \theta_3 \theta_2$ do not hold, we have evidence against the SEM (the causal assumptions).

Moreover we can note that, obviously, not all regressions are good. For example if we run this regression

$y_1 = \theta_5 y_2 + u_4$

the coefficient $\theta_5$ do not identify any parameter of the SEM. $\theta_5$ is biased for $\beta_1$ ($x$ play as omitted/confounder variable).

(I dropped out the constant terms for simplicity)

Finally, the initial regression that you had in mind was good, no endogeneity (but it can happen by chance). However I suppose that you had in mind other approach for the problem. I suggest you this one.

There is a direct effect of $x$ on $y_1$ in addition to the relation specified by your second equation. $x$ is missing from your first equation. — sgtbp, Nov 23 '20 at 08:58
Ok, later I add the case that account for that. However this was precisely what I feared. From your explanation was clear that you are interested in the causal effect of $x$ on $y_ 1$ and also that have an indirect one (through $y_ 2$), but was not clear if you assumed a direct effect too. Now it is clear. Note that the causal assumptions have to precede the estimation procedure. From your question I fear that you conflated them (common mistake). — markowitz, Nov 23 '20 at 10:38

Endogenous controls in linear regression - Alternative approach?

2 Answers2

Linked