Testing for difference in group means with skewed data with many zeros

Question

Background

I am trying to perform a test for difference in means between two groups in a dataset with around 25k records, where 97% of the Y values are 0, and the non-zero Y values are heavily skewed, like so:

Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
2.00   25.00   50.00   85.05  100.00 3000.00

My dataset looks like this (I am checking for a difference in average amount between segments 1 and 2):

> head(march)
  Campaign Amount Donated  Segment 
1    March      0       0 Segment1 
2    March     50       1 Segment1 
3    March    100       1 Segment2

where Donated is a dummy variable indicating whether Amount is greater than 0.

Based on my reading (papers and other Cross Validated links below), it seems that my best option is to use a zero inflated or hurdle model, and the best fit I got was with a hurdle model.

The question

Both zero inflated and hurdle models give me two distinct p-values, one for whether both segments are equally likely to donate, and one for whether the average donation between the two groups is different, given that Amount is greater than 0.

Technically neither of these tells me whether the overall group means are statistically different. Can I back into a single test statistic from here? Or do I need to use a different approach?

Current model

The best fit I have so far is:

> marchL <- glm(formula = Donated ~ Segment, family = binomial(link = "logit"), data=march)
> marchD = lm(log(Amount) ~ Segment, data = subset(march, Donated == 1))
> summary(marchL)

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)     -3.57033    0.05404 -66.062   <2e-16 ***
Segment2        -0.19525    0.08023  -2.434   0.0149 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

> summary(marchD)

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      3.79865    0.04682  81.141  < 2e-16 ***
Segment2         0.27405    0.06959   3.938 9.12e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

So segment 2 is less likely to get a donation, but that donation will be ~27% higher, which doesn't actually answer the question.

What I consulted already

What is the difference between zero-inflated and hurdle distributions (models)?
https://stats.stackexchange.com/a/111626
Regression Models for Count Data in R (A Zeileis)
Modelling skewed data with many zeros: A simple approach combining ordinary and logistic regression (D FLETCHER 2005)
Comparing species abundance models (JM Potts 2006)

These all deal with modeling the data, but not with testing for a difference in means

thanks in advance for any help

Testing for difference in group means with skewed data with many zeros

0 Answers0