1

I am looking at data set that has four groups. In each group, the data is mostly, 99+% of time, composed of zeros, but, when it is not zero it can be any float number (e.g., 0.01 to 921.2, with most values being under 10). Once I examine dataset 1, I want to examine other datasets that also have 4 groups and similar sparseness in the data. Sometimes the n or number of observations in a group can be as low as 10 or as high as, say, 20,000.

I want to calculate a point estimate and confidence intervals (CI) around that estimate for each group so that I can quickly determine whether group 1 is say, worse than group 2.

My question: is it appropriate to calculate the CI using mean and standard error (stdev / sqrt(n) ) with such a sparse data set? Any advice would be appreciated!!

captain_ahab
  • 1,301
  • 1
  • 12
  • 21

1 Answers1

2

I want to calculate a point estimate and confidence intervals (CI) around that estimate for each group so that I can quickly determine whether group 1 is say, worse than group 2.

A point estimate isn't necessarily a problem; you can estimate a mean by a mean, though the extreme skewness may be an issue (e.g. a mean may not be representative of either the bulk of zeros nor the mean of the non-zero data)

You might consider modelling it as a Bernoulli 0/not-0 and then find a model for the not-0 cases.

My question: is it appropriate to calculate the CI using mean and standard error (stdev / sqrt(n) ) with such a sparse data set?

The $s/\sqrt{n}$ formula is still a standard error, but a multiple of it may nor be much help as in interval for the mean.

In really large samples (large enough to have say thousands of non-zero observations), that might be a useful approach, but since the sample size can be small this may have some issues as well.

Glen_b
  • 257,508
  • 32
  • 553
  • 939
  • I like the idea of modeling the non-zero's separately! So, to be clear, you think that using the st. error for an interval for the mean is problematic? Do you have thoughts on an alternative? Thanks for you help! – captain_ahab Dec 09 '14 at 00:31
  • Please describe *HOW* you will use the standard error for a mean to obtain a CI, and maybe I can tell you whether or not it's problematic. I do plan to come back with some additional information but it will take a while to generate it all. – Glen_b Dec 09 '14 at 00:32
  • Thanks Glen. I was simply planning on finding the CI around the mean by adding/subtracting 1.96*st.error (at least in cases where n is large enough). My only concern is that the data is extremely skewed and sparse. – captain_ahab Dec 09 '14 at 03:06
  • Thanks for clarifying. How do you decide when n is large enough? – Glen_b Dec 09 '14 at 03:54
  • Based on the CLT, a heuristic of at least 30 – captain_ahab Dec 09 '14 at 19:40
  • What does the CLT actually say? You might like to read [this](http://stats.stackexchange.com/questions/61798/example-of-distribution-where-large-sample-size-is-necessary-for-central-limit-t/61849#61849), which contains an example where n=1000 is only just about enough (and gives a means to find examples where n=1000 wouldn't be nearly enough). So highly skew distributions are one example where n=30 or in some situations even n=3000 might not be nearly enough. – Glen_b Dec 09 '14 at 20:25
  • nice link and point! okay, so the answer to my question ('is it appropriate to calculate the CI using mean and standard error (stdev / sqrt(n) ) with such a sparse data set?') is that it is probably only sensible if n is sufficiently large (probably being a rather large number). And the trick is to figure out when the n is large enough in my particular case. – captain_ahab Dec 09 '14 at 20:36
  • Yes - at least if you're using the $\pm \,\! 1.96\,\!\! \times\,\!\! \text{se}$ thing - that's what I was trying to convey. You may want to hold off on awarding your tick (you can easily click it again to unaward it and then award it later if you wish) until you get some more concrete advice about what you might do instead, from me or someone else. – Glen_b Dec 09 '14 at 21:40