How should I model a continuous dependent variable in the $[0, \infty]$ range?

Question

I have a dependent variable that can range from 0 to infinity, with 0s actually being correct observations. I understand censoring and Tobit models only apply when the actual value of $Y$ is partially unknown or missing, in which case data is said to be truncated. Some more information on censored data in this thread.

But here 0 is a true value that belongs to the population. Running OLS on this data has the particular annoying problem to carry negative estimates. How should I model $Y$?

> summary(data$Y)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    0.00    0.00    7.66    5.20  193.00 
> summary(predict(m))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  -4.46    2.01    4.10    7.66    7.82  240.00 
> sum(predict(m) < 0) / length(data$Y)
[1] 0.0972098

Developments

After reading the answers, I'm reporting the fit of a Gamma hurdle model using slightly different estimate functions. The results are quite surprising to me. First let's look at the DV. What is apparent is the extremely fat tailed data. This has some interesting consequences on the evaluation of the fit that I will comment on below:

enter image description here

quantile(d$Y, probs=seq(0, 1, 0.1))
        0%        10%        20%        30%        40%        50%        60%        70%        80%        90%       100% 
  0.000000   0.000000   0.000000   0.000000   0.000000   0.000000   0.286533   3.566165  11.764706  27.286630 198.184818

I built the Gamma hurdle model as follows:

d$zero_one = (d$Y > 0)
logit = glm(zero_one ~ X1*log(X2) + X1*X3, data=d, family=binomial(link = logit))
gamma = glm(Y ~ X1*log(X2) + X1*X3, data=subset(d, Y>0), family=Gamma(link = log))

Finally I evaluated the in-sample fit using three different techniques:

# logit probability * gamma estimate
predict1 = function(m_logit, m_gamma, data)
{
  prob = predict(m_logit, newdata=data, type="response")
  Yhat = predict(m_gamma, newdata=data, type="response")
  return(prob*Yhat)
}

# if logit probability < 0.5 then 0, else logit prob * gamma estimate 
predict2 = function(m_logit, m_gamma, data)
{
  prob = predict(m_logit, newdata=data, type="response")
  Yhat = predict(m_gamma, newdata=data, type="response")
  return(ifelse(prob<0.5, 0, prob)*Yhat)
}

# if logit probability < 0.5 then 0, else gamma estimate
predict3 = function(m_logit, m_gamma, data)
{
  prob = predict(m_logit, newdata=data, type="response")
  Yhat = predict(m_gamma, newdata=data, type="response")
  return(ifelse(prob<0.5, 0, Yhat))
}

At first I was evaluating the fit by the usual measures: AIC, null deviance, mean absolute error, etc. But looking at the quantile absolute errors of the above functions highlight some issues related to the high probability of a 0 outcome and the $Y$ extreme fat tail. Of course, the error grows exponentially with higher values of Y (there is also a very large Y value at Max), but what is more interesting is that relying heavily on the logit model to estimate 0s produce a better distribution fit (I wouldn't know how to better describe this phenomena):

quantile(abs(d$Y - predict1(logit, gamma, d)), probs=seq(0, 1, 0.1))
           0%           10%           20%           30%           40%           50%           60%           70%           80%           90%          100% 
   0.00320459    1.45525439    2.15327192    2.72230527    3.28279766    4.07428682    5.36259988    7.82389110   12.46936416   22.90710769 1015.46203281 
quantile(abs(d$Y - predict2(logit, gamma, d)), probs=seq(0, 1, 0.1))
         0%         10%         20%         30%         40%         50%         60%         70%         80%         90%        100% 
   0.000000    0.000000    0.000000    0.000000    0.000000    0.309598    3.903533    8.195128   13.260107   24.691358 1015.462033 
quantile(abs(d$Y - predict3(logit, gamma, d)), probs=seq(0, 1, 0.1))
         0%         10%         20%         30%         40%         50%         60%         70%         80%         90%        100% 
   0.000000    0.000000    0.000000    0.000000    0.000000    0.307692    3.557285    9.039548   16.036379   28.863912 1169.321773

Is the variable continuous other than at 0? If so, a zero-inflated model (e.g. zero-inflated gamma / zero-inflated lognormal etc might be used) — Glen_b, Mar 17 '15 at 15:10
Yes. It's not a probit model for sure. I'm a bit hesitant about zero-inflated models, because it seems to suggest 0s were over-reported and the IV distribution "has a problem", so to speak, where in my case all IV values are correct. — Robert Kubrick, Mar 17 '15 at 15:11
And how about removing the intercept (I know, I know... but here the origin is truly 0)? — Robert Kubrick, Mar 17 '15 at 15:15
Can you give us some insight as to the data generating mechanism? For instance, compound poisson distributions can handle this type of data, but really this is designed for modelling the sum of some statistic applied to a collection of random events (i.e. sum of cost of insurance claims on a policy). — jlimahaverford, Mar 17 '15 at 15:17
@jlimahaverford Y is the distance from target in meters, so it cannot be 0. — Robert Kubrick, Mar 17 '15 at 15:22
@RobertKubrick is that a typo? Do you mean it _can_ be zero? — shadowtalker, Mar 17 '15 at 15:43
@RobertKubrick a zero-inflated model would also make sense if you can postulate a two-step data generating process, where "zero or not zero" gets decided first, and then "how far, given not zero" gets decided after. — shadowtalker, Mar 17 '15 at 15:52
This may be a strange comment, but how are we hitting the target so much? Is it big? If the target is a point in space then we would expect distance from target to always be strictly greater than zero. Consider the problem where the target is a circle of radius $r$. I would be inclined to add $r$ to all my positive measurements and for my zeros sample (probably uniformly) from a circle of radius $r$. This would give you positive values (almost surely) while not departing seriously from the true nature of the data generating mechanism. — jlimahaverford, Mar 17 '15 at 17:32
Robert, some concrete examples of zero-inflated models to take it out of abstraction-land (and to help explain why they are not really about data "having a problem"): (1) how many cigarettes one smoked per day in the past 30 days may require a zero-inflated model, since *one must be in the habit of smoking cigarettes* for the number to be >0. (2) How many fish one caught on this boating trip requires that one *went fishing* for the number to be >0. — Alexis, Mar 17 '15 at 20:50

score 15 · Accepted Answer · edited Apr 13 '17 at 12:44

Censored vs. inflated vs. hurdle

Censored, hurdle, and inflated models work by adding a point mass on top of an existing probability density. The difference lies in where the mass is added, and how. For now, just consider adding a point mass at 0, but the concept generalizes easily to other cases.

All of them imply a two-step data generating process for some variable $Y$:

Draw to determine whether $Y = 0$ or $Y > 0$.
If $Y > 0$, draw to determine the value of $Y$.

Inflated and hurdle models

Both inflated (usually zero-inflated) and hurdle models work by explicitly and separately specifying $\operatorname{Pr}(Y = 0) = \pi$, so that the DGP becomes:

Draw once from $Z \sim Bernoulli(\pi)$ to obtain realization $z$.
If $z = 0$, set $y = z = 0$.
If $z = 1$, draw once from $Y^* \sim D^*(\theta^*)$ and set $y = y^*$.

In an inflated model, $\operatorname{Pr}(Y^* = 0) > 0$. In a hurdle model, $\operatorname{Pr}(Y^* = 0) = 0$. That's the only difference.

Both of these models lead to a density with the following form: $$ f_D(y) = \mathbb{I}(y = 0) \cdot \operatorname{Pr}(Y = 0) + \mathbb{I}(y \geq 0) \cdot \operatorname{Pr}(Y \geq 0) \cdot f_{D^*}(y) $$

where $\mathbb{I}$ is an indicator function. That is, a point mass is simply added at zero and in this case that mass is simply $\operatorname{Pr}(Z = 0) = 1 - \pi$. You are free to estimate $p$ directly, or to set $g(\pi) = X\beta$ for some invertible $g$ like the logit function. $D^*$ can also depend on $X\beta$. In that case, the model works by "layering" a logistic regression for $Z$ under another regression model for $Y^*$.

Censored models

Censored models also add mass at a boundary. They accomplish this by "cutting off" a probability distribution, and then "bunching up" the excess at that boundary. The easiest way to conceptualize these models is in terms of a latent variable $Y^* \sim D^*$ with CDF $F_{D^*}$. Then $\operatorname{Pr}(Y^* \leq y^*) = F_{D^*}(y^*)$. This is a very general model; regression is the special case in which $F_{D^*}$ depends on $X\beta$.

The observed $Y$ is then assumed to be related to $Y^*$ by: $$ Y = \begin{align}\begin{cases} 0 &Y^* \leq 0 \\ Y^* &Y^* > 0 \end{cases}\end{align} $$

This implies a density of the form $$ f_D(y) = \mathbb{I}(y = 0) \cdot F_{D^*}(0) + \mathbb{I}(y \geq 0) \cdot \left(1 - F_{D^*}(0)\right) \cdot f_{D^*}(y) $$

and can be easily extended.

Putting it together

Look at the densities: $$\begin{align} f_D(y) &= \mathbb{I}(y = 0) \cdot \pi &+ &\mathbb{I}(y \geq 0) \cdot \left(1 - \pi\right) &\cdot &f_{D^*}(y) \\ f_D(y) &= \mathbb{I}(y = 0) \cdot F_{D^*}(0) &+ &\mathbb{I}(y \geq 0) \cdot \left(1 - F_{D^*}(0)\right) &\cdot &f_{D^*}(y) \end{align}$$

and notice that they both have the same form: $$ \mathbb{I}(y = 0) \cdot \delta + \mathbb{I}(y \geq 0) \cdot \left(1 - \delta\right) \cdot f_{D^*}(y) $$

because they accomplish the same goal: building the density for $Y$ by adding a point mass $\delta$ to the density for some $Y^*$. The inflated/hurdle model sets $\delta$ by way of an external Bernoulli process. The censored model determines $\delta$ by "cutting off" $Y^*$ at a boundary, and then "clumping" the left-over mass at that boundary.

In fact, you can always postulate a hurdle model that "looks like" a censored model. Consider a hurdle model where $D^*$ is parameterized by $\mu = X\beta$ and $Z$ is parameterized by $g(\pi) = X\beta$. Then you can just set $g = F_{D^*}^{-1}$. An inverse CDF is always a valid link function in logistic regression, and indeed one reason logistic regression is called "logistic" is that the standard logit link is actually the inverse CDF of the standard logistic distribution.

You can come full circle on this idea, as well: Bernoulli regression models with any inverse CDF link (like the logit or probit) can be conceptualized as latent variable models with a threshold for observing 1 or 0. Censored regression is a special case of hurdle regression where the implied latent variable $Z^*$ is the same as $Y^*$.

Which one should you use?

If you have a compelling "censoring story," use a censored model. One classic usage of the Tobit model -- the econometric name for censored Gaussian linear regression -- is for modeling survey responses that are "top-coded." Wages are often reported this way, where all wages above some cutoff, say 100,000, are just coded as 100,000. This is not the same thing as truncation, where individuals with wages above 100,000 are not observed at all. This might occur in a survey that is only administered to individuals with wages under 100,000.

Another use for censoring, as described by whuber in the comments, is when you are taking measurements with an instrument that has limited precision. Suppose your distance-measuring device could not tell the difference between 0 and $\epsilon$. Then you could censor your distribution at $\epsilon$.

Otherwise, a hurdle or inflated model is a safe choice. It usually isn't wrong to hypothesize a general two-step data generating process, and it can offer some insight into your data that you might not have had otherwise.

On the other hand, you can use a censored model without a censoring story to create the same effect as a hurdle model without having to specify a separate "on/off" process. This is the approach of Sigrist and Stahel (2010), who censor a shifted gamma distribution just as a way to model data in $[0, 1]$. That paper is particularly interesting because it demonstrates how modular these models are: you can actually zero-inflate a censored model (section 3.3), or you can extend the "latent variable story" to several overlapping latent variables (section 3.1).

Truncation

Edit: removed, because this solution was incorrect

Truncation is not correct; you indeed want censoring. Truncation of a continuous distribution leads to *zero* probability that an endpoint is realized. Censoring of a distribution at zero causes the probability of all negative values to be accumulated at zero, creating a discrete atom there: "zero inflation." As a matter of fact, the comments suggest the censoring actually occurs at a tiny positive value $\epsilon$ and that all values less than $\epsilon$ are recorded as zeros: that is censoring, pure and simple. — whuber, Mar 17 '15 at 20:03
@whuber I just never thought about that before. For a r.v. $X$ that is truncated below $c$, $\operatorname{Pr}(X \leq c) = 0$, right? That, and the half-Cauchy or half-t distributions are used for _positive_ scale parameters, anyway. Now I'm curious about the difference between censored distributions and boundary-inflated truncated distributions... — shadowtalker, Mar 17 '15 at 20:14
Mathematically I see no difference. Conceptually I would be willing to acknowledge that there may be a subtle difference, in that the censored distribution suggests to us what the distribution of a hypothetical uncensored result would be, whereas sticking an atom at the boundary of a truncated distribution makes no suppositions about that hypothetical distribution. There can be a mathematical (*i.e.* real) difference in more complex censoring situations, such as when censoring limits vary. — whuber, Mar 17 '15 at 20:18
@whuber it just comes down to how the atom is parameterized, right? As in, a zero-censored Gaussian would be equivalent to a zero-inflated, truncated Gaussian with a Probit "link function" — shadowtalker, Mar 17 '15 at 20:22
I think so. I realize my earlier comments were conceptualized in a non-regression framework. For regression the truncation/censoring limit *must* be allowed to vary. For a parent distribution $F$ and left-censoring limit $0$, the likelihood of $y$ given covariates $X$ and parameters $\beta$ is $dF(y-X\beta)$ for $y\gt 0$ and otherwise is $F(-X\beta)$, which obviously varies since $X$ must vary. I presume your "truncated" models operate similarly. I'm not sure what role a link function would play here. — whuber, Mar 17 '15 at 20:31
@whuber are you sure? Wouldn't it just be $F_\theta(y)$ where $\theta$ is some parameter that depends on $X\beta$? Then the mass at the censoring boundary (let's say zero) would be $F_\theta(0)$. That's what I know as the Tobit-I model, and it wouldn't make sense in most of the typical econometrics applications if the boundary itself were forced to vary. — shadowtalker, Mar 17 '15 at 20:42
That sounds like a more complicated model (in general) than the simple censored regression one I described. I am not suggesting the boundary for $y$ will vary: however, relative to the distribution of the *residual*, the boundary does vary. If you like, let $\theta=X\beta$ and set $F_\theta(y)=F(y-\theta)=F(y-X\beta)$, so that $F_\theta(0)=F(-X\beta)$: that exhibits my model as a special case of the more general one you just described. — whuber, Mar 17 '15 at 20:48
@whuber very good point, and now that I think about it that must actually be the case with the Tobit regression as well (although I've never seen it brought up in an econometrics paper or textbook). — shadowtalker, Mar 17 '15 at 20:50
@ssdecontrol I was leaning towards a gamma hurdle model, intuitively it makes the most sense for my problem. I'm not clear why you would use a Tobit model censored at 0 instead. Some information here: http://seananderson.ca/2014/05/18/gamma-hurdle.html — Robert Kubrick, Mar 18 '15 at 13:35
@RobertKubrick I'm not sure either -- you know your data better than I do! But I wanted to at least clarify what the model does because there seemed to be some confusion about it in your question — shadowtalker, Mar 18 '15 at 13:42
@ssdecontrol ok, I think it would be interesting to understand the mathematical differences between a censored 0 and gamma hurdle model. But that's the subject for a different, more complex question... — Robert Kubrick, Mar 18 '15 at 14:24
@RobertKubrick that's a bit of what I got into with whuber. I can edit in some details about the distinction — shadowtalker, Mar 18 '15 at 14:25
@ssdecontrol Yes, now that we've narrowed the focus to those two methodologies, it would be very interesting to understand the differences! Gamma hurdle vs. Tobit censored at 0, that is. — Robert Kubrick, Mar 18 '15 at 14:27
Zero-inflation and hurdle models are nicely compared and contrasted at http://seananderson.ca/2014/05/18/gamma-hurdle.html. In brief, the hurdle models allow the zero responses to be modeled differently than the non-zero responses, thereby being more flexible but less parsimonious. I imagine the initial choice of a model would rely primarily on your concept of how the data are generated. — whuber, Mar 18 '15 at 16:47
Although this answer keeps getting better and better, your description of censoring doesn't really fit the introductory characterization. It doesn't work by first determining whether $Y \gt 0$. Instead, the observations (for left censoring at zero) are simply the transformed variable $\max(0, Y^{*})$. It's unclear what "$l$" is or what role it plays, either. — whuber, Mar 18 '15 at 18:51
@whuber that equation is riddled with typos. Also, I know this isn't the typical interpretation of censoring but I wanted to tie them together. The "$max$-transformed" version is very concise for this case but it doesn't generalize to multiple censoring bounds. — shadowtalker, Mar 18 '15 at 19:01
But you're right that it isn't necessary. I fixed the equation and dropped that characterization entirely — shadowtalker, Mar 18 '15 at 19:09
Sorry for off-topic; could you please take a look at this Meta discussion? https://stats.meta.stackexchange.com/questions/5040 I think your opinion would be important as you seem knowledgeable in this topic. Cheers. — amoeba, Nov 10 '17 at 21:48

Repmat · Answer 2 · 2015-03-18T14:22:10.673

0

Let me start by saying that applying OLS is entirely possible, many real life applications does this. It causes (sometimes) the problem that you can end up with fitted values less than 0 - I assume this is what you are worried about? But if only very few fitted values are below 0, then I would not worry about it.

The tobit model can (as you say) be used in the case of censored or truncated models. But it also applies directly to your case, in fact the tobit model was invented your case. Y "piles" up at 0, and is otherwise rougly continuous. The thing to remember is that the tobit model is difficult to interpret, you would need to rely on APE and PEA. See the comments below.

You could also apply the possion regression model, which has an almost OLS like interpretation - but it's normally used with count data. Wooldridge 2012 CHAP 17, contains a very neat discussion of the subject.

edited Mar 18 '15 at 14:22

answered Mar 17 '15 at 15:27

Repmat

3,182
1
15
32

"the tobit model has no real interpretation" could you explain this a bit more? I can't say I agree with that statement as-is. – shadowtalker Mar 17 '15 at 15:34
I added the negative estimate percentage to my question. It's 10% of the sample, quite high. What would be $c$, the censoring constant, in my case? There is no fixed censoring value after which $Y$ becomes 0. And what are APE and PEA? – Robert Kubrick Mar 17 '15 at 15:35
@ssdecontrol In the tobit model the \beta_j measure the partial effects of the x_j on E(y*|x), where y* is the (so-called) latent variable. The variable that OP would like model is y, which is the observed outcome (hours worked, charitable contributions etc.). This is why one should rely on the average partial effect (APE) and partial effect at the average (PEA). You should not use the censored model. If your data was censored you would know, ie: you have a question about income, where the last answer is "I earn more than $x", that information can "put into" the estimation -> censoring. – Repmat Mar 17 '15 at 15:45
@user3551644 sure, but I don't see how you can say that the model therefore has "no real interpretation" – shadowtalker Mar 17 '15 at 15:46
Hmm, okay, I will update my answer. – Repmat Mar 17 '15 at 15:49
The likelihood functions of censored and truncated models can differ radically with the same data. In what circumstances is your claim true that they are "almost identical"? – whuber Mar 17 '15 at 20:05