6

I have a large population of data instances (say, 1000 instances) that are either of class1 or of class2. I would like to obtain a confidence interval for how many instances are of class1 without exhaustively checking all instances. I have randomly sampled 50 instances, and all 50 were of class1. I used the rule of three to determine that a 95% confidence interval for the percentage that an instance is of class1 is [0.94, 1].

From my sampling, I know that at least 50 instances are of class1. For the remaining 1000 – 50 = 950 instances whose classes are unknown, I assume I can apply the [0.94, 1] confidence interval found above. Therefore, can I conclude that, with a 95% confidence, there are at least 50 + (1000 – 50)(0.94) = 943 instances from the population of 1000 that are of class1?

If this conclusion isn’t statistically sound, how can I obtain a confidence interval for class1?

thatWiseGuy
  • 171
  • 2
  • 1
    No even if this was an exact confidence interval that is an incorrect interpretation. – Michael R. Chernick Apr 20 '17 at 19:05
  • What the confidence interval does say is that in repeated sampling 95% of the intervals generated would include the true binomial success parameter. – Michael R. Chernick Apr 20 '17 at 19:10
  • @MichaelChernick OK, perhaps the rule of three is the wrong approach. How can I obtain a confidence interval for the number of `class1` instances out of the population of 1000, based on sampling? – thatWiseGuy Apr 20 '17 at 19:21
  • This has nothing to do with the rule of three. The problem is with your interpretation of the confidence interval. It does **not mean that with 95% confidence at least 943 out of 1000 are of class1**. – Michael R. Chernick Apr 20 '17 at 19:28
  • 1
    @Michael It is different to fathom your comments, because (a) the rule of three does apply and (b) the characterization of the one-sided confidence limit, albeit a little informal, is common and has a correct interpretation. – whuber Apr 20 '17 at 19:37
  • @Michael Can the rule of three be used to conclude with 95% confidence that fewer than 3 instances in 50 from the population will be of `class2`? The [Wikipedia](https://en.wikipedia.org/wiki/Rule_of_three_(statistics)) page seems to suggest this. – thatWiseGuy Apr 20 '17 at 19:39
  • @whuber I know that the rule of three applies as an approximation. I never said that it didn't. You may view the OPs statement as correct. I don't because he uses a specific number of cases (943) out of a specific group of 1000 cases. I think that is a wrong interpretation and not just a little informal. – Michael R. Chernick Apr 20 '17 at 19:41
  • The Wikipedia page gives an informal statement in the same way that you did. But many statistician would find that objectionable. I know that from my graduate school days my statistics professors insisted that coverage in repeated sampling was a necessary part of the definition of confidence interval. You will find it that way in many statistics texts. – Michael R. Chernick Apr 20 '17 at 19:47
  • @Michael When the samples all fall to one class or the other (and the sample size is reasonably large, so it is not a problem of the sample set being too small), is there a formal way to find a confidence interval for the number of `class1` instances in the population? – thatWiseGuy Apr 20 '17 at 19:55
  • Related: https://stats.stackexchange.com/questions/82720/confidence-interval-around-binomial-estimate-of-0-or-1/82724#82724 – kjetil b halvorsen Apr 20 '17 at 20:16
  • 1
    @Michael The rule of three is an approximation only in the sense that $3$ approximates $-\log(1/0.05)= 2.9957\ldots$, which is *excellent* precision given that the population size is determined to three significant figures anyway! I am concerned that your objections appear likely to mislead readers, including the OP, into thinking his reasoning and answers are incorrect, whereas exactly the opposite seems true. – whuber Apr 20 '17 at 20:46
  • The above link deals with methods for generating binomial confidence intervals. @whuber It is fine that the approximation is that good for 95% confidence. To thatWiseGuy. There is nothing that need to be changed about the sample size. It is just that technically you can't say that a fixed number will be inside or outside the interval even when you qualify it by saying with 95% confidence. – Michael R. Chernick Apr 20 '17 at 20:52
  • @whiber Even the Clopper-Pearson method (called exact) doesn't give exact power or sample size required due to the discrete natural of the cumulative binomial. You may recall my paper In the American Statistician in 1982 on this subject. – Michael R. Chernick Apr 20 '17 at 20:56
  • I meant to say that my paper was published in 2002 and not 1982. – Michael R. Chernick Apr 20 '17 at 21:31
  • @Michael I don't see how your paper would be relevant, because this is not binomial sampling. Even if it were, a perfectly exact solution is available due to the special nature of the results: namely, 100% of the sample is all of one class--and that's what the Rule of Three is all about. – whuber Apr 20 '17 at 21:46
  • It just has to do with exact vs approximation and is a little bit off the subject. – Michael R. Chernick Apr 20 '17 at 21:58

1 Answers1

3

The procedure described in the question is intuitive, clear, and accurate.

Problem Formulation

Formally, this is a hypergeometric sampling problem: in a population of $N=1000$ subjects, of which $K$ are in Class 1 and $N-K$ are in Class 2, a sample of size $n=50$ is taken without replacement and it is observed that all $n$ of them are in Class 1. A $95\%$ lower confidence limit $K_{0.95}$ for $K$ is the smallest value that is consistent with these data in the sense that if $K$ were any less than $K_{0.95}$, then the chance that every member of the sample is in Class 1 (as it turned out to be) would be less than $1 - 0.95 = 0.05 = \alpha$, which would be implausible.

Solution

This chance, as a function of the unknown $K$, is easy to compute. Because the sample of $n$ can be taken one at a time, and each time the values of both $K$ and $N$ decrease by $1$, it is equal to the product of the individual chances of observing a subject in Class 1:

$$P(K,n,N) = \frac{K}{N} \times \frac{K-1}{N-1} \times \cdots \times \frac{K-n+1}{N-n+1}.$$

This is a product of a sequence of decreasing fractions. Since $n\ll N$, the obvious bounds (based on replacing each term by the first fraction $K/N$ on the one hand and the first fraction that has been omitted, $(K-n)/(N-n)$, on the other hand) give an excellent approximation:

$$\left(\frac{K-n}{N-n}\right)^n \lt P(K,n,N) \lt \left(\frac{K}{N}\right)^n.$$

The value of $K_{0.95}$ will therefore lie between the solutions $K$ to

$$n\log\left(\frac{K-n}{N-n}\right) \lt \log(\alpha) \lt n\log\left(\frac{K}{N}\right),$$

given by

$$n + (N-n)(1 - 3/n) \approx n + (N-n)(1 + \log(\alpha)/n) \gt K;\\K \gt N \exp(\log(\alpha)/n) \approx N \exp(-3/n).$$

(The appearance of $3$ as the approximation to $-\log(0.05)= 2.9957\ldots$ is the basis for this "Rule of Three".) With $N=1000$ and $n=50$ we have

$$941.764 \lt K_{0.95} \lt 943.082$$

(and these bounds are not appreciably changed by using $3$ instead of $-\log(0.05)$).

The right hand value (upper bound) is the value proposed in the question. In fact, the precise solution is $K_{0.95} = 943$ because

$$P(943, 50, 1000) = 0.04924 \lt 0.05 \le 0.051099 = P(944, 50, 1000).$$

whuber
  • 281,159
  • 54
  • 637
  • 1,101
  • I agree with the way you form the endpoints but not with the interpretation of the interval. Perhaps it does help to clear up some of the OPs confusion. – Michael R. Chernick Apr 20 '17 at 22:01
  • 2
    @Michael If you truly disagree with the interpretation, then you are implicitly claiming the reasoning is incorrect from the outset! Are you trying to say that the chances are *not* equal to the values I have calculated? Please note that this approach begins not with any theorems, formulas, or approximations beyond the axioms of probability: it works directly from the definition of a confidence limit. – whuber Apr 20 '17 at 23:08
  • It is not the value. It is that a confidence interval always refers to percentage of times the proportion is included **when the processes is repeated many many times**. The words in bold print are what is missing in the interpretation, isn't that right? It refers to the various different intervals that can be generated when the process is repeated and not the specific interval that was observed even though the interval was constructed correctly. – Michael R. Chernick Apr 20 '17 at 23:58
  • 2
    @michael, all I can suggest is that you review the definition of a CI. – whuber Apr 21 '17 at 03:30
  • [Here is a paper](https://www.jstor.org/stable/pdf/27919727.pdf?refreqid=excelsior%3A99799e01f5e85fc28adfa4ab395c7449) with an extended discussion. – kjetil b halvorsen Jun 19 '19 at 10:00