When endogenous variable is non-normally distributed

Question

I am trying to fit an IV probit model with my dependent variable being binary. However, my endogenous variable is a "ratio" variable, which is not normally distributed, thereby preventing me from employing an IV Probit model. I am wondering whether employing LPM with 2SLS could help resolve the issue of non-normality of the endogenous variable?

dimitriy · Accepted Answer · 2020-07-16T15:47:08.820

2

The only assumption required of the endogenous variable with this command is that it is continuous. You do not need normality for it, only for the errors in the two equations (which need to be multivariate normal and homoskedastic). You also can't leave anything out of the model. See here for more.

Ratios are usually continuous, so you should not have a problem. This restriction comes from the fact that ivprobit is a control function estimator.

edited Jul 16 '20 at 15:47

answered Jul 15 '20 at 06:26

dimitriy

31,081
5
63
138

Thanks for your comment, Dimitriy! In fact, my endogenous "ratio" variable has a considerable number of zeros, so I suppose even the error term is the first stage is not normally distributed. I think this is a quite big issue for employing IV estimation. – Zhenkai Ran Jul 15 '20 at 09:44
1

I don't think this is necessarily as issue. You cannot assume that the unconditional distribution of a variable tells you something about the error distribution. Here [a nice example](https://stats.stackexchange.com/a/398553/7071). Moreover, the regression is a model for the conditional mean, and the CLT (under some assumptions) allows the mean to be asymptotically normal, even if the variable that goes into the mean is not. This is also the logic for why normal, non-probit IV regression would works well in this case. – dimitriy Jul 15 '20 at 13:31
I would certainly encourage you to fit both the MLE and min-$\chi^2$ versions of the estimator as wells as ordinary IV models to see how sensitive results are. – dimitriy Jul 15 '20 at 13:32
@ZhenkaiRan Did this clear things up? – dimitriy Jul 16 '20 at 02:57
Thanks for following up, Dimitriy! I am still wondering whether I can fit my first stage first, and then test the normality of the residual? Does this help justify my model setting of using IVprobit, or 2SLS? I just tried to fit the first stage, and found out that the residual is not normally distributed by using the "swilk" command in STATA. So, this indicates 2SLS or IVProbit is not suitable? – Zhenkai Ran Jul 16 '20 at 12:48
That normality is sufficient, but not necessary. To quote Wooldridge's [black book (p.592) ](https://mitpress.mit.edu/books/econometric-analysis-cross-section-and-panel-data-second-edition) , "it is certainly possible for $D(u_1 \vert v_2)$ to be normal without $v_2$ having normal distribution." My recommendation is still to fit all possible models to see if they agree. If they don't, you can report the swilk p-value (or plots to see how bad the departure), and say that the ivprobit models are not great because they make stronger assumption that may not hold here, which favors the normal IV. – dimitriy Jul 16 '20 at 16:12
Note that the spelling is Stata since it is not an acronym. Also, what is the range of the ratio variable? Is it strictly $[0,1]$? – dimitriy Jul 16 '20 at 16:25
Thanks for your advice, Dimitriy! They are really helpful. The ratio is within the unit interval with an excessive amount of zeros. – Zhenkai Ran Jul 17 '20 at 00:42
Then you might also try David Roodman's `cmp` command, which allows you to have a fractional endogenous variable instead: `cmp (y = x n k) (x = n k z), ind($cmp_probit $cmp_frac)`. Just run `cmp setup` to define the globals beforehand. – dimitriy Jul 17 '20 at 00:48
Thanks very much for your advice, Dimitriy! I will have a look at the `cmp` command. This helps a lot! – Zhenkai Ran Jul 17 '20 at 01:18

When endogenous variable is non-normally distributed

1 Answers1