2

I have a set of 9000 outcomes and I wish to make an estimate (with 95% confidence interval) of the amount of 'positive' examples. After viewing 413 examples I found 13 positives (positive rate of 13/413 = 0.0315). With a cumulative binomial distribution I can make an estimate of the total positives with a 95% confidence interval.

However, the calculated positive rate is based on a small sample and addition of another single positive example can shift the 95% confidence interval by a large amount. How can I incorporate the size of my sample into the estimate so it is more robust?

My matlab code:

N_found = 14;
N_excl = 400;
N_tot=9000;
binoinv([0.05 0.95],N_tot-N_found-N_excl,N_found/(N_found+N_excl))+N_found
  • 2
    If you go from 13 positive to 14 positive cases, the interval *should* shift substantially -- the point estimate will increase by a multiple of nearly 14/13 (i.e. up by about 7.5%) – Glen_b Feb 07 '17 at 23:32

1 Answers1

2

To get a confidence interval 95% or otherwise, for a binomial parameter p you can use the Clopper-Pearson method which is exact. You can take a look at Hahn and Meeker's Statistical Intervals: A Guide for Practitioners First Edition Wiley 1991.

Michael R. Chernick
  • 39,640
  • 28
  • 74
  • 143
  • Are you suggesting "exact" means "robust"? – whuber Feb 07 '17 at 17:57
  • 1
    @whuber No exact means that is has the exact confidence level specified. – Michael R. Chernick Feb 07 '17 at 18:01
  • Then how does this respond to the question? It asks for "robust" solutions and it makes it clear what it means by "robust." – whuber Feb 07 '17 at 18:11
  • Should I use the lower and upper proportion values in my estimate and run it twice? And then take the upper and lower bounds of both answers? – Héctor van den Boorn Feb 07 '17 at 18:35
  • 1
    For a two-sided confidence interval the upper and lower bounds that you get initially are the appropriate ones. You do not need to do this twice. Also you asked for a robust solution. The solution gives you exactly 95% coverage and there is no need for robustness. – Michael R. Chernick Feb 07 '17 at 18:41
  • This doesn't work since the 95% interval does not take into account how many instances are left. However, by running my original algorithm twice (once with lower p and upper p estimates) I do get a relatively stable estimate which converges as N increases. Problem with using only the p-values is that at the end it estimates e.g. 240 positves as lower bound while I already found 260, by double-running it, the lower bound of the total positives is set at 262 for example. – Héctor van den Boorn Feb 07 '17 at 19:39
  • 1
    One almost never gets "exactly" the nominal coverage, Michael, because the distribution is discrete. Intervals for rare proportions are particularly problematic in this regard. I checked Hahn & Meeker (first edition, pp 103-108) and could not find any claim there that this is an exact procedure. – whuber Feb 10 '17 at 18:54