Confusion about method of moments for linear regression

Question

It is known that linear regression estimator can also be viewed as a method of moment estimator derived using the moment condition $E[X\epsilon]=0$. This moment condition follows from exogeneity assumption $E[\epsilon|X]=0$. But exogeneity assumption implies $E[h(X)\epsilon]=0$ for any $h(X)$ by law of iterated expectation. Why do we choose $h(X)=X$ specifically and not any other function of $X$?

Thanks in advance.

You are talking of linear regression so it seems natural to consider the linear function? — hbadger19042, Jul 14 '20 at 08:27
@kevin012 Could you elaborate a little? I don't see how linearity in the coefficient $\beta$ relates to $h(X)$ being a linear function. — RevealedPreference, Jul 14 '20 at 18:57
@RevealedPreference Could you add a proof that the exogeneity assumption $E[\epsilon | X] = 0$ and the law of iterated expectation imply that $E[h(X)\epsilon] = 0$? — dwolfeu, Jul 14 '20 at 22:01
@RevealedPreference I've posted this as a [question](https://stats.stackexchange.com/questions/477532/exogeneity-assumption-applied-to-functions-of-the-design-matrix). — dwolfeu, Jul 17 '20 at 05:22
@dwolfeu Sorry that I forgot to check on my question for the last few days. My proof for $E[h(X)\epsilon ] = 0$ would be similar to what you wrote in your question. $E[h(X)\epsilon ] = E[ E[h(X)\epsilon |X]] = E[ h(X)E[\epsilon |X]]$. $h(X)$ can be treated as a constant w.r.t. the inter expectation because it is a function of X. — RevealedPreference, Jul 18 '20 at 15:00
@RevealedPreference No worries! Your proof makes sense; I was going round the houses trying to use the expected value of a product of random variables. If you give your proof as an answer to my question, then I can upvote it! — dwolfeu, Jul 18 '20 at 17:58
@dwolfeu Thanks. I will do that right away. Also thank you for your help on this question! I've upvoted your answer. :) — RevealedPreference, Jul 19 '20 at 21:46

dwolfeu · Accepted Answer · 2020-07-16T05:44:01.067

Short answer

The function $h(X)=X$ is used for the GMM because it is equivalent to the OLS estimator, which by the Gauss–Markov theorem is the best linear unbiased estimator.

The details

We start with some notation to avoid any confusion with rows and columns:

\begin{equation*} X = \begin{bmatrix} x_{11} & \ldots & x_{1p} \\ \vdots & \ddots & \vdots\\ x_{n1} & \ldots & x_{np} \end{bmatrix} ,\;\bar{y} = \begin{bmatrix} y_1 \\ \vdots\\ y_n \end{bmatrix} ,\;\bar{\beta} = \begin{bmatrix} \beta_1 \\ \vdots\\ \beta_p \end{bmatrix} ,\;\bar{\epsilon} = \begin{bmatrix} \epsilon_1 \\ \vdots\\ \epsilon_n \end{bmatrix} \end{equation*}

We assume that $X$ has full column rank.

Taking $h(X) = X$, the GMM conditions are

\begin{equation} E\left[ \begin{bmatrix} x_{j1} & \cdots & x_{jn} \end{bmatrix} \begin{bmatrix} \epsilon_1 \\ \vdots\\ \epsilon_n \end{bmatrix} \right] = 0 \end{equation}

for $j \in \{1,\ldots,p\}$, i.e. the expected covariance of each column of $X$ with the errors is 0. We can put these $p$ conditions into one neat equation as follows:

\begin{equation} E\left[ X^T\bar{\epsilon}\right] = \bar{0} \end{equation}

(Here $\bar{0}$ denotes the zero vector.)

To find an estimate of $\bar{\beta}$ using the GMM, we need to minimise the sample estimate of $E\left[ X^T\bar{\epsilon}\right]$ with respect to $\bar{\beta}$, i.e. we need to find the value of $\bar{\beta}$ that minimises the norm of the following expression:

\begin{equation} X^T\!\left(\bar{y} - X\bar{\beta}\right) \end{equation}

Notice that $X\bar{\beta}$ is in the column space of $X$, since it is a linear combination of the columns of $X$. Also note that $X^T\!\left(\bar{y} - X\bar{\beta}\right) = \bar{0}$ if and only if $X\bar{\beta}$ is the projection of $\bar{y}$ on to the column space of $X$, since if $X\bar{\beta}$ is anything else in the column space of $X$, then the vector $\bar{y} - X\bar{\beta}$ isn't orthogonal to the column space and thus the dot products in the expression $X^T\!\left(\bar{y} - X\bar{\beta}\right)$ are not 0. The following diagram (taken from Wikipedia) illustrates this point:

We want to minimise $X^T\!\left(\bar{y} - X\bar{\beta}\right)$ with respect to $\bar{\beta}$, which is clearly achieved when $X^T\!\left(\bar{y} - X\bar{\beta}\right) = \bar{0}$. So we rearrange the equation $X^T\!\left(\bar{y} - X\bar{\beta}\right) = \bar{0}$ to find the necessary value of $\bar{\beta}$:

\begin{equation} \bar{\beta} = \left(X^TX\right)^{-1}X^T\bar{y} \end{equation}

But this is just the usual OLS estimator, which by the Gauss–Markov theorem is the best linear unbiased estimator.

Thanks dwolfeu! I guess the point is it happens to be the case that GMM estimator using $h(X)=X$ is exactly the same as OLS estimator. But we can also use general $h(x)$ to get a different estimator ( if sufficient conditions for convergence of GMM estimator are satisfied by $h(X)$)? — RevealedPreference, Jul 18 '20 at 15:03
@RevealedPreference Precisely. One can indeed use a different $h(X)$, but it won't give you a better estimator (and might result in a function that is difficult to minimise). — dwolfeu, Jul 18 '20 at 18:01

Confusion about method of moments for linear regression

1 Answers1

Linked