What is the actual definition of endogeneity?

Question

I've been learning about endogeneity but after looking around online I've gotten more and more confused about what the definition is.

Most pages say that in a model $y=X\beta+\epsilon$ the definition of endogeneity is $E[X'\epsilon] \neq 0$. But a lot of these same pages say that endogeneity is when $X$ is correlated with the error, or in other words, (if I am understanding this correctly) $Cov(X',\epsilon) \neq 0$. But these two things are not the same in general, right?

So in total I'd like to know what the definition of endogeneity is. Am I just confused? Is the definition of "correlated" different than what I think it is?

Since you don't actually tell us what you think "correlated" means, we may have difficulties answering your question. But here's a hint about the situation: what is the value of $E[\epsilon]$? What role does this quantity play in the formula for the covariance? When you account for that, what does the formula reduce to? — whuber, Feb 20 '17 at 23:53
I thought that to say $X$ and $\epsilon$ are correlated is to say that $Cov(X', \epsilon) \neq 0$. Is that correct? As for your point, I suppose the covariance that I wrote reduces to the definition I've seen if $E[\epsilon]=0$. Is it the case that if $\beta$ is properly fitted then that is true? While I know it is true for OLS, I don't see why that has to be true in general. — user35734, Feb 21 '17 at 00:22
The distribution of $\epsilon$ has nothing whatsoever to do with how the model is fitted. The *only* things you know about $\epsilon$ are what you assume about it. Your question is about the *model*, not about data or OLS. — whuber, Feb 21 '17 at 18:05
Ok. So it's accurate to say that in arbitrary linear models, the covariance definition and the expected value of a product definition of endogeneity are different correct? So which one is actually the true definition? — user35734, Feb 21 '17 at 23:28
They are mathematically equivalent given that the expectation of $\epsilon$ is zero. The proof, which is elementary (and almost trivial), uses standard formulas for the covariance. — whuber, Feb 21 '17 at 23:46

score 1 · Accepted Answer · answered Apr 09 '19 at 13:52

You are correct in noting that, if $E \epsilon \neq 0$, $$ E[X \epsilon] \neq Cov(X, \epsilon) = E[X(\epsilon - E\epsilon)]. $$ However, assuming $E \epsilon = 0$ is usually without loss of generality. In particular, if $X$ contains a constant and if the coefficient on the constant carries no "structural" interpretation then we can always redefine this coefficient to make sure that $E \epsilon =0$.

To see this, write $X = (1, W')'$ and $\beta = (\beta_0, \beta_1')'$. Plug in, solve for $\epsilon$ and take expectation to obtain:

$$ E[\epsilon] = -E[Y - W \beta_1] + \beta_0. $$

This shows that choosing $\beta_0 = E[Y - W \beta_1]$ guarantees $E \epsilon = 0$.

markowitz · Answer 2 · 2020-04-30T10:52:01.147

The fact that you are confused is not so strange in my opinion. Recently I spent some effort in this direction.

Definition of exogeneity/endogeneity in econometrics frequently is ambiguous. For this reason there is ambiguous treatment of the causality. Read here: Regression and causality in econometrics

Note that endogenous/exogenous is a concept that should have only causal meaning. This point is matter of debate but my opinion is the previous. Read this related topic: Structural equation and causal model in economics

Other goal in econometrics is forecasting but in this setting the endogeneity problem do not play an important role. Read here: Endogeneity in forecasting

Basically, the most important concept is that the exogeneity condition must related to structural error ($u$). Statistically speaking the most frequent definition is to mean conditional independence, like: $E[u|X]=0$ that is stronger than orthogonality $E[uX]=0$; note that $E[u]=0$ is valid by assumpion not by costruction. So the orthogonality and scorrelation between (structural) error and covariates/regressor are the same thing.

Note that in regression the orthogonality/scorrelation (now $u$ is regression error = residual), is valid by costruction not by assumption; in general $E[u|X]=0$ do not hold bu it is not very important.

Shortly, the interpretation of $u$ is crucial. Most confusion coming from this point.

Basically the so called "error term" tha you find in some econometrics presentation must be interpreted as true/structural error. Others peculiarity about: orthogonality, correlation, conditional independence, full independence; can produce only confusion if the distinction between the two type of error above is not clear.

What is the actual definition of endogeneity?

2 Answers2

Linked