Exact difference between two-part models (e.g., Cragg) and Tobit type 2 models (e.g., Heckman)

Question

I want to run a regression where the dependant variable (DV) is the amount of funding (in USD) obtained by startups. Naturally the DV contains a lot of zero's (~55%) and has a continuous distribution for y>0.

In general my understanding is that the Tobit model (or a variation of it) is approriate to model this DV.

Although reading and discussing now for months I still struggle to get my head around the exact difference between the standard Tobit (1958) model, the two-part extensions proposed by Cragg (1971) and the Tobit Type 2 model, e.g., represented by Heckmann (1974, 1976, 1979). My current understanding is that all models could be in theory applicable with varying pros and cons and potential reasons why not to use them at all (depending on the exact characteristics of the data set).

Why I excluded the standard Tobit model

For my application I excluded the standard Tobit model as it only allows both processes to be governed by the same variables for which also only one coefficient is reported. Hence, the effect of a certain variable cannot have a different sign in the selection and outcome equation (which is though sometimes the case).

Tobit Type 2 (or Heckmann selection model) vs. two-part model (Cragg)

My understanding so far is that the key difference between the 2 models is the fact that two-part models assume only true zeros, whereas Tobit Type 2 accounts also (or only?) for unobserved zeros (e.g., people who generally do not smoke are a 0 and people who generally smoke but at a certain point of time cannot afford to smoke are a 0 as well)

This, however, is not entirely true as Cragg (1971) originally also proposed a double hurdle model where 2 hurdles have to be overcome before positive values of y are observed: "First, a positive amount has to be desired [(i.e., I am a smoker or not)]. Second, favourable circumstances have to arise for the positive desire to be carried out [(i.e., I am a smoker and I have sufficient funds to afford smoking)]".

I think this means that the Tobit Type II accounts for both types of zeros (or only unobserved?) in the first selection equation and the outcome equation is truncated at y>0, the single hurdle Cragg model only accounts for true zeros in the selection equation and the double-hurdle Cragg model accounts for 'unobserved' zeros during the selection and 'true' zeros during the outcome equation.

Questions

Is my statement about the three models correct? And what does this exactly mean? Are the sources of zeros the only/main decision criteria? If so, this would mean for me with regard to my data: startups decide to apply for funding or not (first source of zeros -> unobserved), subsequently the market decides to supply funding or not (second source of zeros -> observed) and, in the positive case, how much (y > 0) -> Cragg's double hurdle model (the real 'double' hurdle model which is often wrongly mistaken for the single hurdle model)
Irrespectively of my (potentially wrong) conlusion: What are the key decision criteria I should consider/discuss when deciding about what type of model to use (Tobit Type 2 (Heckmann) model or a two-part model (either single hurdle (all zeros are true zeros) or double hurdle (zeros can arise at selection and consumption))? Is there more than 'just' the source of the zeros?

Additional information

This paper (which is a great read! Brad R. Humphreys, 2013) and especially one of its key graphics highlights the difference between unobserved zeros (i.e., missing data, companies not looking for funding) and observed zeros (i.e., investors providing funding or not) very well. It also provides guidance on which models to use, but unfortunately, does not provide a solution for data where both types of zeros are present at the same time.

Potential solution

After digging deeper I found two papers that offer a statistical solution to exactly what I am looking for:

Blundell, Richard and Meghir, Costas, (1987), Bivariate alternatives to the Tobit model, Journal of Econometrics, 34, issue 1-2, p. 179-200. (http://sites.psu.edu/scottcolby/wp-content/uploads/sites/13885/2014/07/Blundell1987_Bivariate-alternatives-to-the-tobit-model.pdf) describe a double hurdle model that assumes dependence. For an application see Blundell, Richard, Ham, John and Meghir, Costas, (1987), Unemployment and Female Labour Supply, Economic Journal, 97, issue 388a, p. 44-64.
Another solution is offered by Moulton, Lawrence H., and Neal A. Halsey. “A Mixture Model with Detection Limits for Regression Analyses of Antibody Response to Vaccine.” Biometrics, vol. 51, no. 4, 1995, pp. 1570–1578. www.jstor.org/stable/2533289 who describe a Bernoulli/Lognormal Mixture Model for Censored Data that also accounts for both types of zeros.

Unfortunately I couldn't find any trustworthy implementation in Stata or R (there is a package called mhurdle, but it appears not to be working well with weights and throwing random errors...)

Any comments or further ideas?

Did you figure out an answer on your own? If so, please answer below—I'm interested in the answers to the questions you posed. — Mark White, Jan 03 '18 at 04:50

score 4 · Answer 1 · answered Jan 05 '18 at 14:09

Thanks for asking, Mark. In context of my data I ended up using the double hurdle model proposed by Blundell (the first bullet of my suggested solutions). Based on the feedback I received on academic conferences this seems to be a viable approach. I also ended up using the R-package mhurdle. Weights simply do not work - the rest of the code seems to be very solid.

Regarding my specific questions; I do not have a finite answer to all of them, but let me try to summarise what I learnt:

Is my statement about the three models correct? It appears so - yes

Are the sources of zeros the only/main decision criteria? They are certainly not the only decision criteria, but in context of data with a mass point at zero, spending significant time on understanding how the zeros are generated is tremendously important.

What are the key decision criteria I should consider/discuss when deciding about what type of model to use? Besides the obvious questions regarding type of dependent variable and its distribution, the two main questions involving data with a mass point at zero are: Do you want to distinguish your results by the two different stages or is it sufficient to report one set of coeffcients? If so you may use a Tobit model; otherwise you need a two-part model where the discussion about the different sources of zeros comes into play.

Is there more than 'just' the source of the zeros? Yep - there is. At least two: observed/true zeros and unobserved/false zeros (the latter actually being either NAs or so small values that are recoded as 0)

Hope this helps you a bit! Jan

Apologies for responding on an old post. I am facing a similar dilemma in choosing/justifying the appropriate model. I'm looking at determinants of private paid tutoring expenditure. Naturally, those who don't take tutoring have zero expenditure. I'm currently using Tobit and hurdle model, as these are the commonly used techniques in the literature. Do you suggest this is appropriate? Further, could you shed some light on how (if at all) Heckman is different from hurdle? Your advise will be very valuable! Thank you! — PGupta, Dec 06 '20 at 20:31
Sorry for my late reply: Heckman + Hurdle are very different models. Heckman corrects for a non-randomly selected sample (e.g., analysing wages means you only consider who actually works -> you need a Heckman with a first stage, where you estimate the likelihood of people participating in the workforce) -> unobserved Zeros. Hurdle: you observe the 0's (e.g., the number of goals a football team scores), but want to / need to model the likelihood of scoring at all vs. scoring more than 1 goal separately. It is all about where the 0's come from and whether you actively observe them or not. — JNWHH, May 31 '21 at 09:13

Exact difference between two-part models (e.g., Cragg) and Tobit type 2 models (e.g., Heckman)

Additional information

Potential solution

1 Answers1