5

What is the point of using the identity matrix as weighting matrix in GMM?

GMM is the minimizer of the distance $g_n(\delta)'\hat{W}g_n({\delta})$, where $g_n = \frac{1}{n}\sum_ix_i\epsilon_i$. If we set $\hat{W}=I$, we would get a distance equal to $g_n(\delta)'g_n({\delta})$, i.e. the sum of squared coordinates of $g_n$.

The result of the minimization is still a GMM estimator but it is clearly not efficient (we should have set $\hat{W}=S^{-1}$, where $S = \frac{1}{n}\sum_ix_i'\epsilon_i'\epsilon_ix_i$).

So why should we proceed in this direction? Is it something common in practice as a first step towards the best GMM or are there other reasons?

PhDing
  • 2,470
  • 6
  • 32
  • 57

2 Answers2

5

Yes, getting a first step estimator is the canonical use. Of course, the error terms in $$S = \frac{1}{n}\sum_i\epsilon_i^2x_ix_i'$$ are not observable, so that you need to replace them with something feasible. As the efficient GMM estimator depends on $\hat S$, you first need some feasible preliminary estimator such as the one using $I$ as the weighting matrix.

There may be some further interesting considerations in a multiple equation setup, in which misspecification in one equation can "pollute" the entire system. You can avoid that risk through a less efficient, but more robust block-diagonal weighting matrix, of which $I$ would be an example.

Christoph Hanck
  • 25,948
  • 3
  • 57
  • 106
  • Here, can I ask a question in this question? When We set $W=I$, why GMM is inefficient? How can I prove this? Can you please mention about this proof? Thank you. – 1190 Jun 25 '21 at 14:23
  • I posted a second answer addressing this point (for any $W$, not just $I$). – Christoph Hanck Jun 25 '21 at 15:00
3

This second answer addresses the question posed in the comment to the first answer as to why the specific choice of $W$ results in an efficient GMM estimator.

The efficient weighting matrix results from the general one by setting $W=S^{-1}$ to get an asymptotic variance \begin{eqnarray} \mathrm{Avar}(\widehat{\delta}(\widehat{S}))&=&(\Sigma_{xz}'S^{-1}\Sigma_{xz})^{-1}\Sigma_{xz}'S^{-1}SS^{-1}\Sigma_{xz}(\Sigma_{xz}'S^{-1}\Sigma_{xz})^{-1}\notag\\ &=&(\Sigma_{xz}'S^{-1}\Sigma_{xz})^{-1}\Sigma_{xz}'S^{-1}\Sigma_{xz}(\Sigma_{xz}'S^{-1}\Sigma_{xz})^{-1}\notag\\ &=&(\Sigma_{xz}'S^{-1}\Sigma_{xz})^{-1}\label{avareffgmm}%\\[-4ex] \end{eqnarray} We therefore need to show that the difference between the general asymptotic variance and the one with the specific (to be shown) efficient weighting matrix is p.d.: $$ (\Sigma_{xz}'W\Sigma_{xz})^{-1}\Sigma_{xz}'WSW\Sigma_{xz}(\Sigma_{xz}'W\Sigma_{xz})^{-1}-(\Sigma_{xz}'S^{-1}\Sigma_{xz})^{-1}\geqslant0$$ Linear algebra (see Thm. 1.24, Magnus/Neudecker 1988, i.e. $A-B\geqslant0\Leftrightarrow B^{-1}-A^{-1}\geqslant0$, much like $3>2$, but $1/2>1/3$) tells us that this condition is equivalent to $$ Q:=\Sigma_{xz}'S^{-1}\Sigma_{xz}-\Sigma_{xz}'W\Sigma_{xz}(\Sigma_{xz}'WSW\Sigma_{xz})^{-1}\Sigma_{xz}'W\Sigma_{xz}\geqslant 0$$ As $S$ is p.d., $S^{-1}$ can be decomposed as $S^{-1}=C'C$. Further define $H=C\Sigma_{xz}$ and $G=C'^{-1}W\Sigma_{xz}$. Then, \begin{eqnarray*} Q&=&\Sigma_{xz}'C'C\Sigma_{xz}-\Sigma_{xz}'W\Sigma_{xz}(\Sigma_{xz}'WC^{-1}C'^{-1}W\Sigma_{xz})^{-1}\Sigma_{xz}'W\Sigma_{xz}\\ &=&H'H-\Sigma_{xz}'W\Sigma_{xz}(G'G)^{-1}\Sigma_{xz}'W\Sigma_{xz}\\ &=&H'H-\Sigma_{xz}'C'C'^{-1}W\Sigma_{xz}(G'G)^{-1}\Sigma_{xz}'WC^{-1}C\Sigma_{xz}\\ &=&H'H-H'G(G'G)^{-1}G'H\\ &=&H'(I-G(G'G)^{-1}G')H\\[-4ex] \end{eqnarray*} The matrix in brackets is, as usual, symmetric and idempotent and therefore p.s.d. Thus, for an arbitrary $a$, \begin{eqnarray*} a'Qa&=&a'H'(I-G(G'G)^{-1}G')Ha\\ &=:&c'(I-G(G'G)^{-1}G')c\geqslant0 \end{eqnarray*}

Christoph Hanck
  • 25,948
  • 3
  • 57
  • 106