GLM with continuous data piled up at zero

Question

I am trying to run a model to estimate how well catastrophic illnesses such as TB, AIDS etc affect spending on hospitalization. I have "per hospitalization cost" as the dependent variable and various individual markers as independent variables, almost all of which are dummy such as gender, head of household status, poverty status and of course a dummy for whether you have the illness (plus age and age squared) and a bunch of interaction terms.

As is to be expected, there is a significant amount -- and I mean a lot -- of data piled up at zero (i.e., no expenditure on hospitalization in the 12 month reference period). What would be the best way to deal with data such as these?

As of now I decided to convert the cost into ln(1+cost) so as to include all observations and then run a linear model. Am I on the right track?

Is your response actually a count? The term you are looking for is *zero-inflation*. — gung - Reinstate Monica, Jun 30 '14 at 21:05
One can have zero-inflated continuous distributions as well; there are zero-inflated gamma models for example. — Glen_b, Jun 30 '14 at 23:19
@Glen_b, that's what I had in mind. I've never done one, though. Frank Harrell's suggestion of OLR is a clever way to work around the problem as well. — gung - Reinstate Monica, Jul 01 '14 at 03:01

Frank Harrell · Answer 1 · 2020-09-03T11:29:55.773

9

As discussed elsewhere on the site, ordinal regression (e.g., proportional odds, proportional hazards, probit) is a flexible and robust approach. Discontinuities are allowed in the distribution of $Y$, including extreme clumping. Nothing is assumed about the distribution of $Y$ for a single $X$. Zero inflated models make far more assumptions than semi-parametric models. For a full case study see my course handouts Chapter 15 at http://hbiostat.org/rms .

One great advantage of ordinal models for continuous $Y$ is that you don't need to know how to transform $Y$ before the analysis.

edited Sep 03 '20 at 11:29

answered Jun 30 '14 at 21:59

Frank Harrell

74,029
5
148
322

@Frank_harrell There seems no longer to be a chapter 15 in the handouts linked, unless I misunderstand where to look. Is the material still available? – InColorado Sep 02 '20 at 23:55
I edited to update the link. – Frank Harrell Sep 03 '20 at 11:30
Sorry - full set of course notes is there now – Frank Harrell Sep 04 '20 at 19:12

score 8 · Answer 2 · edited Nov 12 '17 at 00:28

8

Clumping at 0 is called "zero inflation". By far the most common cases are count models, leading to zero-inflated Poisson and zero-inflated negative binomial regression. However, there are ways to model zero inflation with real positive values (e.g. zero-inflated gamma model).

See Min and Agresti, 2002, Modelling non negative data with clumping at zero for a review of these methods.

edited Nov 12 '17 at 00:28

amoeba

93,463
28
275
317

answered Jun 30 '14 at 21:17

Peter Flom

94,055
35
143
276

score 1 · Answer 3 · answered Jun 30 '14 at 21:14

The suggestion of using a zero-inflated Poisson model is an interesting start. It has some benefits of jointly modeling the probability of having any illness-related costs as well as the process of what those costs turn out to be should you have any illness. It has the limitation that it imposes some strict structure on what the shape of the outcome is, conditional upon having accrued any costs (e.g. a specific mean-variance relationship and a positive integer outcome... the latter of which can be relaxed for some modeling purposes).

If you are okay with treating the illness-related admission and illness-related costs conditional upon admission processes independently, you can extend this by first modeling the binary process of y/n did you accrue any costs related to illness? This is a simple logistic regression model and allows you to evaluate risk factors and prevalence. Given that, you can restrict an analysis to the subset of individuals having accrued any costs and model the actual cost process using a host of modeling techinques. Poisson is good, quasi-poisson would be better (accounting for small unmeasured sources of covariation in the data and departures from model assumptions). But sky's the limit with modeling the continuous cost process.

If you absolutely need to model the correlation of parameters in the process, you can use bootstrap SE estimates. I see no reason why this would be invalid, but would be curious to hear others' input if this might be wrong. In general, I think those are two separate questions and should be treated as such to have valid inference.

GLM with continuous data piled up at zero

3 Answers3

Linked