meaning of error term being correlated with regressor

Question

I have encountered the statement that "the error term and one of the regressors are correlated" a few times and I am having trouble understanding what is meant exactly. Let's say we have a DGP $$y=\beta_1+\beta_2 x+\beta_3 z+e $$ where the error term $e$ fulfills all of the standard OLS assumptions. Furthermore $x$ and $z$ are correlated. Now if the coefficients of the relation $$y=\lambda_1+\lambda_2 x+u$$ where estimated the statement could be made that the estimate of $\lambda_2$ is biased since $x$ is correlated with the error term $u$. But this statement seems to be assuming that we are trying to estimate the coefficient $\beta_2$ from the first equation.

So my question is the following. When we speak of an error term being correlated with a regressor, does this mean that there is a specific coefficient we are trying to estimate,( in the above example the coefficient $\beta_2$) and plugging it into the given equation will yield a relationship where the regressor and error term are correlated?

Does this post not answer your question? https://stats.stackexchange.com/questions/263324/how-can-the-regression-error-term-ever-be-correlated-with-the-explanatory-variab — Nick Koprowicz, Jan 04 '20 at 03:54
I do not think the linked duplicate is an answer. In my opinion this question is more about identification that about confusing residuals with errors. — Jesper for President, Jan 07 '20 at 14:50

score 1 · Answer 1 · answered Feb 01 '20 at 22:06

Assume that the distributions here are such that

$$E[y\mid (x,z)] = \beta_2 x+\beta_3 z$$

Then we can write

$$y = E[y\mid (x,z)] + e_{xz}$$

where $e_{xz}$ is the conditional expectation function error (of the specific conditional expectation). By design, as can be easily verified,

$$E[e_{xz} \mid (x,z)] = 0 \implies E[e_{xz} x] = E[e_{xz} z]=0$$

Assume further that it holds that

$$E[y\mid x] = \lambda_2 x$$

Here too we can write

$$y = E[y\mid x] + e_{x},\qquad E[e_{x} \mid x] = 0$$

In general, we expect that $\beta_2 \neq \lambda_2$.

So which is the "true" coefficient? Which coefficient expresses the "true" relationship?

Both and neither.
Both, because both $\beta_2$ and $\lambda_2$ reflect a valid statistical relationship between $y$ and $x$. In the first case, with $z$ present. In the second case, with $z$ absent.

Neither, because, the word "true" can only be associated with a causal relationship -and nowhere up to now did we argue about causal relationships, only about statistical relationships.

But econometrics is the arrogant child of statistics that believes it can argue about causal relationships all the time, when the parent frowns, or worse, every time the word "causality" is uttered.

This happens because econometrics is the applied arm of a social/behavioral science, economics. And economics starts by theorizing on causal relationships. In our example, assume that this theorizing leads to the specification using the $x,z$ variables. Then we are "forced" by our own theoretical/behavioral/causal arguments to consider the specification with the $(x,z)$ and the betas as the correct one.

So from the two valid statistical relationships (or from the "infinite" number of such valid statistical relationships), we want to estimate the first specification, because we consider it to be the "causally true" specification, because, we argue, all other effects on $y$ are unimportant, average to zero, and do not affect the causality carriers, $(x,z)$.

But estimators are like computers: they do what they tell them to do, not what they want them to do. If we give OLS the $(y,x)$ sample, we "tell" it, whether we like it or not, to estimate the lambda coefficient. To see this, applying OLS we will obtain

$$\hat \beta_2 = \frac {\sum x_iy_i}{\sum x_i^2}$$

Now, we can substitute in this expression either one of the expressions for $y$. So

$$\hat \beta_2 = \frac {\sum x_i(\beta_2x_i + \beta_3z_i+e_{xz})}{\sum x_i^2} = \beta_2 + \beta_3\frac {\sum x_iz_i}{\sum x_i^2} + \frac {\sum x_ie_{xz}}{\sum x_i^2} \to_p \beta_2 + \beta_3\frac {E(xz)}{E(x^2)}$$

But also,

$$\hat \beta_2 = \frac {\sum x_i(\lambda_2x_i + e_{x})}{\sum x_i^2} = \lambda_2 + \frac {\sum x_ie_{x}}{\sum x_i^2} \to_p \lambda_2$$

So no matter how we desire to "baptize" the estimator ("$\hat \beta_2$" in our case, indicating what we want it to estimate), it will estimate consistently $\lambda_2$: it does what we really told it to do, not what we wanted it to do.

The customary way to say this, is declaring that "the regressor is correlated with the error term", but here we are no longer referring just to the conditional expectation function error but to the "error"

$$u = \beta_3 z+ e_{xz}$$.

This is our way to say that we want to estimate the beta but we can only say to the estimator to estimate the lambda.

"+1" Is there a particular reason why this is not also an answer to linked question on [regression](https://stats.stackexchange.com/questions/263324/how-can-the-regression-error-term-ever-be-correlated-with-the-explanatory-variab)? — Jesper for President, Feb 05 '20 at 20:15
@StopClosingQuestionsFast Well, _there_ the upfront issue is the confusion between errors and residuals. Of course, my answer here is pertinent also there, after one clears the error/residual confusion. — Alecos Papadopoulos, Feb 06 '20 at 19:04
I disagree. Because the way I read your answer here is that it only makes sense to talk about the OLS estimator $\hat \beta_2$ as being consistent/inconsistent once we have indentified what we want to estimate. If I simply write $y = \theta x + v$ then $\hat \beta_2$ is consistent if $\theta =\lambda_2$ with $\mathbb E[y\lvert x] = \lambda_2 x$ but inconsistent if $\theta = \beta_2$. The point is simply that the OLS estimator always estimates the coeffient of the linear projection consistently (given existence of moments). It is not only residuals that are always orthogonal — Jesper for President, Feb 06 '20 at 19:25
but also the error in the linear projection, which is still a population level property. — Jesper for President, Feb 06 '20 at 19:26
@StopClosingQuestionsFast Indeed that is the case, I thought this was pretty standard knowledge, but I don't understand with what do you disagree. Also: OLS always estimates consistently the partial derivatives of a non-linear conditional expectation function evaluated (the partial derivatives) at the expected values of the regressors. — Alecos Papadopoulos, Feb 06 '20 at 19:48
Anyways If you have the time I posted a question for you [here](https://stats.stackexchange.com/questions/448273/ols-as-approximation-for-non-linear-function) .. :) — Jesper for President, Feb 06 '20 at 20:33

meaning of error term being correlated with regressor

1 Answers1