41

Instrumental variables are becoming increasingly common in applied economics and statistics. For the uninitiated, can we have some non-technical answers to the following questions:

  1. What is an instrumental variable?
  2. When would one want to employ an instrumental variable?
  3. How does one find or choose an instrumental variable?
csgillespie
  • 11,849
  • 9
  • 56
  • 85
Graham Cookson
  • 7,543
  • 6
  • 41
  • 35
  • 4
    Don't you think that the Wikipedia article about it is enough? –  Jul 23 '10 at 17:59
  • 1
    Questions such as this require a wiki / blog post type of response. I do think questions should not require such long answers. –  Jul 23 '10 at 19:50
  • I'm not sure the right thing to do is to simply ignore this question and refer the asker to the wiki - especially during beta where we are trying to build up the content of the site. Perhaps the question asker should submit each of these questions individually so that they can be better addressed. – russellpierce Jul 23 '10 at 20:52
  • 3
    @mbq - the wikipedia example hardly qualifies as nontechnical. It's very reliant on jargon and equations. – rolando2 Aug 04 '11 at 23:39
  • 1
    It HAS become common in economics some time in the 1980s. Some biostaticians have heard of it, too, and apply it in the context of measurement error models, where instruments are narrowly thought of as additional available measurements. They qualify as instruments within the broader econometric context: they are correlated with the variable of interest, and they are uncorrelated with its measurement error. – StasK Aug 23 '11 at 03:08
  • Use of instrumental variables was already quite common when I first studied Econometrics in the late 70s. I suppose it may have become even more common since then. Every Econometric package and its uncle had instrumental variables capability, although some of even the more prominent of the packages incorrectly implemented instrumental variables, and did not perform what was advertised for certain cases - I know, I saw the insides, which in many cases did not match the documentation. – Mark L. Stone Jul 31 '15 at 19:36

4 Answers4

49

[The following perhaps seems a little technical because of the use of equations but it builds mainly on the arrow charts to provide the intuition which only requires very basic understanding of OLS - so don't be repulsed.]

Suppose you want to estimate the causal effect of $x_i$ on $y_i$ given by the estimated coefficient for $\beta$, but for some reason there is a correlation between your explanatory variable and the error term:

$$\begin{matrix}y_i &=& \alpha &+& \beta x_i &+& \epsilon_i & \\ & && & & \hspace{-1cm}\nwarrow & \hspace{-0.8cm} \nearrow \\ & & & & & corr & \end{matrix}$$

This might have happened because we forgot to include an important variable that also correlates with $x_i$. This problem is known as omitted variable bias and then your $\widehat{\beta}$ will not give you the causal effect (see here for the details). This is a case when you would want to use an instrument because only then can you find the true causal effect.

An instrument is a new variable $z_i$ which is uncorrelated with $\epsilon_i$, but that correlates well with $x_i$ and which only influences $y_i$ through $x_i$ - so our instrument is what is called "exogenous". It's like in this chart here:

$$\begin{matrix} z_i & \rightarrow & x_i & \rightarrow & y_i \newline & & \uparrow & \nearrow & \newline & & \epsilon_i & \end{matrix}$$

So how do we use this new variable?
Maybe you remember the ANOVA type idea behind regression where you split the total variation of a dependent variable into an explained and an unexplained component. For example, if you regress your $x_i$ on the instrument,

$$\underbrace{x_i}_{\text{total variation}} = \underbrace{a \quad + \quad \pi z_i}_{\text{explained variation}} \quad + \underbrace{\eta_i}_{\text{unexplained variation}}$$

then you know that the explained variation here is exogenous to our original equation because it depends on the exogenous variable $z_i$ only. So in this sense, we split our $x_i$ up into a part that we can claim is certainly exogenous (that's the part that depends on $z_i$) and some unexplained part $\eta_i$ that keeps all the bad variation which correlates with $\epsilon_i$. Now we take the exogenous part of this regression, call it $\widehat{x_i}$,

$$x_i \quad = \underbrace{a \quad + \quad \pi z_i}_{\text{good variation} \: = \: \widehat{x}_i } \quad + \underbrace{\eta_i}_{\text{bad variation}}$$

and put this into our original regression: $$y_i = \alpha + \beta \widehat{x}_i + \epsilon_i$$

Now since $\widehat{x}_i$ is not correlated anymore with $\epsilon_i$ (remember, we "filtered out" this part from $x_i$ and left it in $\eta_i$), we can consistently estimate our $\beta$ because the instrument has helped us to break the correlation between the explanatory variably and the error. This was one way how you can apply instrumental variables. This method is actually called 2-stage least squares, where our regression of $x_i$ on $z_i$ is called the "first stage" and the last equation here is called the "second stage".

In terms of our original picture (I leave out the $\epsilon_i$ to not make a mess but remember that it is there!), instead of taking the direct but flawed route between $x_i$ to $y_i$ we took an intermediate step via $\widehat{x}_i$

$$\begin{matrix} & & & & & \widehat{x}_i \newline & & & & \nearrow & \downarrow \newline & z_i & \rightarrow & x_i & \rightarrow & y_i \end{matrix}$$

Thanks to this slight diversion of our road to the causal effect we were able to consistently estimate $\beta$ by using the instrument. The cost of this diversion is that instrumental variables models are generally less precise, meaning that they tend to have larger standard errors.

How do we find instruments?
That's not an easy question because you need to make a good case as to why your $z_i$ would not be correlated with $\epsilon_i$ - this cannot be tested formally because the true error is unobserved. The main challenge is therefore to come up with something that can be plausibly seen as exogenous such as natural disasters, policy changes, or sometimes you can even run a randomized experiment. The other answers had some very good examples for this so I won't repeat this part.

Andy
  • 18,070
  • 20
  • 77
  • 100
  • 10
    +1 I am grateful finally to read a detailed answer instead of a list of references or links. – whuber Jul 31 '15 at 19:06
  • 1
    Excellent! I explain this to my students more "mnemonically" as: $x$ is poisoned/tainted by unobserved factors in $\epsilon$. The first-stage regression "cleans"/sucks out the venom from $x$. We can use the "cleaned" version of $x$ to find the causal coefficient, $\beta$. – MichaelChirico Feb 17 '17 at 17:04
  • Is there an intuitive argument why the 2SLS estimate for $\beta$ is consistent? When we calculate $\widehat{x}_i$, we are "filtering out" the part of $x_i$ that is correlated with the error, but why should it be that the filtering out doesn't change $x_i$ in a way that changes our estimate for $\beta$? – user35734 Feb 22 '17 at 00:16
  • See here: http://stats.stackexchange.com/questions/64279/derivation-of-iv-estimator or you may want to ask a new question. Hope this helps. – Andy Feb 22 '17 at 09:19
  • @user35734 it's not consistent but *asymptotically* consistent. – Vim Nov 23 '18 at 11:31
  • Is it enough for $z_i$ to be uncorrelated with $\epsilon_i$? Or does it need to be independent? – Neil G Feb 01 '19 at 15:56
18

As a medical statistician with no previous knowledge of econom(etr)ics, I struggled to get to grips with instrumental variables as I often struggled to follow their examples and didn't understand their rather different terminology (e.g. 'endogeneity', 'reduced form', 'structural equation', 'omitted variables'). Here's a few references I found useful (the first should be freely available, but I'm afraid the others probably require a subscription):

I'd also recommend chapter 4 of:

onestop
  • 16,816
  • 2
  • 53
  • 83
12

Here are some slides that I prepared for an econometrics course at UC Berkeley. I hope that you find them useful---I believe that they answer your questions and provide some examples.

There are also more advanced treatments on the course pages for PS 236 and PS 239 (graduate-level political science methods courses) at my website: http://gibbons.bio/teaching.html.

Charlie

Charlie
  • 13,124
  • 5
  • 38
  • 68
7

Non-technical (usually that's all I'm good for anyway): There are times when not only does X cause Y, but Y causes X as well. An instrumental variable is a device that can "clean up" this messy, inconvenient relationship so that the best estimates can be made of X's effect on Y.

The instrumental variable is chosen by virtue of its relationships: it is a cause of X, but, other than acting through X, it has no effect on Y. The instrument (or instruments) is used in Stage One to compute a new "version" of X, one that is in no way a function of Y. This new "predicted" X is then used in a second stage, in a more standard regression, to explain/predict Y. Hence the term Two-Stage Least Squares regression.

One typically finds the IV in processes that are overriding or beyond the control of X OR Y, such as variables that depend on laws, policies, acts of nature, etc.

rolando2
  • 11,645
  • 1
  • 39
  • 60