12

I am trying to run a model to estimate how well catastrophic illnesses such as TB, AIDS etc affect spending on hospitalization. I have "per hospitalization cost" as the dependent variable and various individual markers as independent variables, almost all of which are dummy such as gender, head of household status, poverty status and of course a dummy for whether you have the illness (plus age and age squared) and a bunch of interaction terms.

As is to be expected, there is a significant amount -- and I mean a lot -- of data piled up at zero (i.e., no expenditure on hospitalization in the 12 month reference period). What would be the best way to deal with data such as these?

As of now I decided to convert the cost into ln(1+cost) so as to include all observations and then run a linear model. Am I on the right track?

amoeba
  • 93,463
  • 28
  • 275
  • 317
user42372
  • 121
  • 1
  • 4

3 Answers3

9

As discussed elsewhere on the site, ordinal regression (e.g., proportional odds, proportional hazards, probit) is a flexible and robust approach. Discontinuities are allowed in the distribution of $Y$, including extreme clumping. Nothing is assumed about the distribution of $Y$ for a single $X$. Zero inflated models make far more assumptions than semi-parametric models. For a full case study see my course handouts Chapter 15 at http://hbiostat.org/rms .

One great advantage of ordinal models for continuous $Y$ is that you don't need to know how to transform $Y$ before the analysis.

Frank Harrell
  • 74,029
  • 5
  • 148
  • 322
8

Clumping at 0 is called "zero inflation". By far the most common cases are count models, leading to zero-inflated Poisson and zero-inflated negative binomial regression. However, there are ways to model zero inflation with real positive values (e.g. zero-inflated gamma model).

See Min and Agresti, 2002, Modelling non negative data with clumping at zero for a review of these methods.

amoeba
  • 93,463
  • 28
  • 275
  • 317
Peter Flom
  • 94,055
  • 35
  • 143
  • 276
1

The suggestion of using a zero-inflated Poisson model is an interesting start. It has some benefits of jointly modeling the probability of having any illness-related costs as well as the process of what those costs turn out to be should you have any illness. It has the limitation that it imposes some strict structure on what the shape of the outcome is, conditional upon having accrued any costs (e.g. a specific mean-variance relationship and a positive integer outcome... the latter of which can be relaxed for some modeling purposes).

If you are okay with treating the illness-related admission and illness-related costs conditional upon admission processes independently, you can extend this by first modeling the binary process of y/n did you accrue any costs related to illness? This is a simple logistic regression model and allows you to evaluate risk factors and prevalence. Given that, you can restrict an analysis to the subset of individuals having accrued any costs and model the actual cost process using a host of modeling techinques. Poisson is good, quasi-poisson would be better (accounting for small unmeasured sources of covariation in the data and departures from model assumptions). But sky's the limit with modeling the continuous cost process.

If you absolutely need to model the correlation of parameters in the process, you can use bootstrap SE estimates. I see no reason why this would be invalid, but would be curious to hear others' input if this might be wrong. In general, I think those are two separate questions and should be treated as such to have valid inference.

AdamO
  • 52,330
  • 5
  • 104
  • 209