2

If I have a dataset of size $N$, and choose $n$ samples, where $k$ of the samples are OK, how confident can I be that the dataset is OK?

To make it more concrete, suppose I have a dataset D1 with 5,000 values. I choose 20 of these and check each one, finding that all 20 are OK. I have dataset D2 with 5,000 values, I choose 20 of these and find out that 18 of them are OK.

I want to answer the question of "how good is the data", based on the sample that I've taken. Perhaps a better question would be - what sample size do I need to take in order to have a confidence interval of $\pm$ values about the proportion given by the sample? Although I'm not sure what an interval would mean in the case of the sample being all correct. Say I take a sample of 20 from 5000 and they're all correct, I can't then infer that the data has between (4900, 5100) correct samples with some degree of confidence (because the maximum is 5000 of course).

I feel that it's probably something binomial, but am unsure.

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
baxx
  • 738
  • 6
  • 21
  • 1
    The top three hits for [a search for the tags "confidence-interval" and "binomial-distribution", ordered by votes](https://stats.stackexchange.com/questions/tagged/confidence-interval%2bbinomial-distribution?tab=Votes) will likely give you all you need. [Confidence interval for Bernoulli sampling](https://stats.stackexchange.com/q/4756/1352) gives you CIs for the proportion of "bad" values in your population. [Confidence interval around binomial estimate of 0 or 1](https://stats.stackexchange.com/q/82720/1352) examines the case where none of your samples were "bad". – Stephan Kolassa Jan 12 '22 at 12:01
  • In addition to Stephan's links, see also [here](https://stats.stackexchange.com/questions/540702/what-is-the-finite-population-correction-for-the-wilson-score-interval-for-a). – COOLSerdash Jan 12 '22 at 12:05

0 Answers0