Background
I am trying to perform a test for difference in means between two groups in a dataset with around 25k records, where 97% of the Y values are 0, and the non-zero Y values are heavily skewed, like so:
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.00 25.00 50.00 85.05 100.00 3000.00
My dataset looks like this (I am checking for a difference in average amount between segments 1 and 2):
> head(march)
Campaign Amount Donated Segment
1 March 0 0 Segment1
2 March 50 1 Segment1
3 March 100 1 Segment2
where Donated is a dummy variable indicating whether Amount is greater than 0.
Based on my reading (papers and other Cross Validated links below), it seems that my best option is to use a zero inflated or hurdle model, and the best fit I got was with a hurdle model.
The question
Both zero inflated and hurdle models give me two distinct p-values, one for whether both segments are equally likely to donate, and one for whether the average donation between the two groups is different, given that Amount is greater than 0.
Technically neither of these tells me whether the overall group means are statistically different. Can I back into a single test statistic from here? Or do I need to use a different approach?
Current model
The best fit I have so far is:
> marchL <- glm(formula = Donated ~ Segment, family = binomial(link = "logit"), data=march)
> marchD = lm(log(Amount) ~ Segment, data = subset(march, Donated == 1))
> summary(marchL)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.57033 0.05404 -66.062 <2e-16 ***
Segment2 -0.19525 0.08023 -2.434 0.0149 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> summary(marchD)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.79865 0.04682 81.141 < 2e-16 ***
Segment2 0.27405 0.06959 3.938 9.12e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
So segment 2 is less likely to get a donation, but that donation will be ~27% higher, which doesn't actually answer the question.
What I consulted already
- What is the difference between zero-inflated and hurdle distributions (models)?
- https://stats.stackexchange.com/a/111626
- Regression Models for Count Data in R (A Zeileis)
- Modelling skewed data with many zeros: A simple approach combining ordinary and logistic regression (D FLETCHER 2005)
- Comparing species abundance models (JM Potts 2006)
These all deal with modeling the data, but not with testing for a difference in means
thanks in advance for any help