0

I'm modeling the ratio of total medical cost in one 12 month period and the subsequent 12 month period (total_medical_cost_ratio = post_total_medical_cost/pre_total_medical_cost).

The data is skewed, heavy-tailed with 1.1% of the values = $0 in the pre period.

What is the best way to represent the ratio of two values when the range of values in the denominator includes zero?

I'd prefer not to exclude these records, which is throwing away valuable data. I've also tried substituting 0.01 for 0 in the denominator, but this increases the skewness and kurtosis of the data more than I'm comfortable with (mean pre/post ratio inflates to 18,000 from 4.4!).

RobertF
  • 4,380
  • 6
  • 29
  • 46
  • 1
    Why are you taking a ratio involving zero in the denominator? – Dave Jan 20 '22 at 15:37
  • @Dave In some records there is zero cost in period 1 but period 2 cost > 0. – RobertF Jan 20 '22 at 15:45
  • 1
    What would a ratio mean in that situation? – Dave Jan 20 '22 at 15:45
  • @Dave A very large increase in cost (infinite magnitude). But is there a more reasonable way to model this huge increase without getting into infinities? Excluding these records would bias the model. – RobertF Jan 20 '22 at 15:49
  • 1
    @Dave I could be approaching this the wrong way - maybe better to simply use a regression model: Period 2 cost = Intercept + Beta*(Period 1 cost)? But then we may get negative predicted values in Period 2. – RobertF Jan 20 '22 at 15:59
  • 1
    Threads like https://stats.stackexchange.com/questions/30728 are directly relevant because they address the same difficulty with zeros and many of the proposed solutions directly apply. There are several red flags in the present instance: there is something fundamentally problematic with analyzing ratios when there's a good chance the denominator will be zero. Think hard about the statistical problem you're trying to solve, because therein may lie some clues to better ways of expressing the data and the model. – whuber Jan 20 '22 at 17:17
  • @whuber Cool, thanks this is helpful. To be more clear, the purpose of our model is to create a data generating process, from which we can generate "artificial" data and calculate an expected value. We're comparing different causal inference techniques (propensity score matching, g-computation, etc.) so we need to compare estimated average treatment effects to a known "true" average treatment effect from a DGP. – RobertF Jan 21 '22 at 20:10
  • In that case, why represent the ratio at all? Just generate the pairs of costs. – whuber Jan 21 '22 at 20:11

0 Answers0