4

While using R for estimating the Heckman 2-stages model a friend stumbled on a problem. the selection() funtion from the package sampleSelection estimates everything just fine. The code is as follows.

# reading the data
dados <- read.dta("mus16data.dta")
# response variable
y <- dados$ambexp
# the transformation of the response that we mean to model
lny <- log(y)
# the variable that will be the response in the probit stage
dy <- y > 0
# the actual model
model <- selection(dy ~ ins + age + female + educ + blhisp + totchr,
  lny ~ age + female + educ + blhisp + totchr + age2, dados)

The estimated covariance matrix for the coefficient estimators is found by calling model$vcov. It agrees with the output from Stata, for example. But when we try to construct the output by "hand" (coding from scratch), using the formulas from the documentation of both R and Stata (which are the same) we do not arrive at the same result as the software. Any hints on what may be wrong? The expressions are actually simple and are seen on Greene's Econometric Analysis in the chapter regarding the models.

Achim Zeileis
  • 13,510
  • 1
  • 29
  • 53
Gustavo
  • 43
  • 1
  • 3

1 Answers1

3

The default estimation in procedure in selection() in R (and hence presumably also in Stata) is full maximum likelihood (ML). Originally, Heckman proposed to use a 2-step estimator which is still discussed in many textbooks but which is less efficient than the ML estimator. In the 70s the ease of computation of the 2-step estimator with standard software was still a major advantage but modern software can easily compute the efficient ML estimator.

In selection() you can set method = "2step" (rather than "ml") to obtain the 2-step estimator.

The underlying workhorse functions are tobit2fit() (for the ML case) and heckit2fit() (for the 2-step case). The former obtains the estimated covariance matrix from the Hessian (via the maxLik package). The latter obtains the covariance matrix from the formulas described in Greene (2003, p.784-785). There is a dedicated function heckitVcov() in the package with reasonably detailed manual page. The corresponding code is also straightforward to follow when comparing it to the equations in Greene's textbook.

Achim Zeileis
  • 13,510
  • 1
  • 29
  • 53
  • Oh, I'm sorry. I actually used method = "2step" and verified that model$method returns "2step". But the return for the estimated covariance matrix does not match the one we obtain from using probit and lm() and the formula in the documentation. Do you think R returns the efficient estimator for the covariance matrix by default? I mean, there is no reason to use the other version, right? (even though it is a consistent estimator) – Gustavo May 17 '15 at 15:13
  • This is fairly well documented in `?heckitVcov` and the corresponding source code is also straightforward. I've updated my reply to contain some more details about the covariance matrix estimation. – Achim Zeileis May 18 '15 at 10:08