Calculating confidence intervals in two sample analysis with extremely skewed count data

Question

I want to identify the effect of a feature on a number of events. Each observation has >= 0 events and is assigned to one group, A or B. Each observation was assigned to a group by random sampling, and observations in group A were exposed to a treatment that group B was not exposed to.

I have a very large amount of observations (> 300K). This is an idea of what my table looks like:

Obs   Nb_events  Group
1     0          A
1     0          B
1     0          B
1     2          A
1     0          A
...

The overall probability of having at least one event is 0.04. If the observation is in group A, it's 0.00442, and if it's in group B, it's 0.00436.

The mean for group A is 0.01 and for group B it is 0.012.

The standard deviation of the number of events is 0.292 for group A and 0.234 for group B.

I want to estimate, as simply as possible, a confidence interval of the effect of being in group B on the number of events and the probability of having at least one event. I have read that my data seems to follow a negative binomial distribution model, or a highly dispersed Poisson model because sample variance > sample mean.

However, I don't know how to find a confidence interval given this. It seems like I can't use a normal approximation, and a Poisson approximation does not seem to fit my data either. Given this, how can I estimate a standard 95% confidence interval as simply as possible for my data?

I also want to find a confidence interval for the probability of having at least one event. I wanted to use the usual confidence interval based on normal approximations, for two samples.

But this also posits a normal approximation using Central Limit Theorem and that does not seem to be verified because my p is extremely low.

Edit: I haven't found an answer to this yet besides using an exact test, which is, I believe, computationally intensive given my sample sizes. Instead of finding a confidence interval, I think simply measuring whether the difference between my proportions is significant would be approproate. I believe a chi square goodness of fit test would work well here, based on [this question][1]. However, I think it would not be applicable to my count data. Can you confirm this?

Confidence intervals are intervals for unknowns (parameters or functions of parameters), not for data. If you want an interval for *data* you'll have to be clearer about what kind of interval you mean (tolerance interval? prediction interval? something else?). If you *do* mean a confidence interval, don't describe it as an interval for data, instead make clear what quantity you're producing an interval for. — Glen_b, Apr 04 '17 at 12:36
You're right, I wanted to make a simplified version of my problem but it made it unclear. I'm going to edit my question. — Konrad, Apr 04 '17 at 13:09
I've edited my comment. The issue is a bit more complex now but it clearly describes my problem and it should be a lot clearer. I hope! — Konrad, Apr 04 '17 at 13:24
What makes you say that "a Poisson approximation does not seem to fit my data either"? On what basis did you draw that conclusion? And given that in the previous paragraph you explicitly state your data seems to follow a negative binomial or overdispersed Poisson, why not use one of those models for calculating your confidence interval? — Ryan Simmons, Apr 04 '17 at 14:03
Alternatively, if you're only interest is calculating the probability of at least one event, you can simply dichotomize the outcome and use a binomial confidence interval. — Ryan Simmons, Apr 04 '17 at 14:10
I said that a Poisson approximation did not seem like it would be a good fit because Poisson assumes mean and variance are equal, I believe, which isn't the case here. — Konrad, Apr 04 '17 at 14:38
I do want to use those models to calculate my confidence interval, but I don't have a strong background in statistics and I can't find a clear resource that explains how to do this simply. Yes, a binomial confidence interval is what I want for the probability of having >= 1 event, but usually resources talk about normal approximations or poisson approximations and my data does not seem to follow either. — Konrad, Apr 04 '17 at 14:41

Calculating confidence intervals in two sample analysis with extremely skewed count data

0 Answers0