How much confidence to have data is correct based on checking N samples

Question

If I have a dataset of size $N$, and choose $n$ samples, where $k$ of the samples are OK, how confident can I be that the dataset is OK?

To make it more concrete, suppose I have a dataset D1 with 5,000 values. I choose 20 of these and check each one, finding that all 20 are OK. I have dataset D2 with 5,000 values, I choose 20 of these and find out that 18 of them are OK.

I want to answer the question of "how good is the data", based on the sample that I've taken. Perhaps a better question would be - what sample size do I need to take in order to have a confidence interval of $\pm$ values about the proportion given by the sample? Although I'm not sure what an interval would mean in the case of the sample being all correct. Say I take a sample of 20 from 5000 and they're all correct, I can't then infer that the data has between (4900, 5100) correct samples with some degree of confidence (because the maximum is 5000 of course).

I feel that it's probably something binomial, but am unsure.

The top three hits for [a search for the tags "confidence-interval" and "binomial-distribution", ordered by votes](https://stats.stackexchange.com/questions/tagged/confidence-interval%2bbinomial-distribution?tab=Votes) will likely give you all you need. [Confidence interval for Bernoulli sampling](https://stats.stackexchange.com/q/4756/1352) gives you CIs for the proportion of "bad" values in your population. [Confidence interval around binomial estimate of 0 or 1](https://stats.stackexchange.com/q/82720/1352) examines the case where none of your samples were "bad". — Stephan Kolassa, Jan 12 '22 at 12:01
In addition to Stephan's links, see also [here](https://stats.stackexchange.com/questions/540702/what-is-the-finite-population-correction-for-the-wilson-score-interval-for-a). — COOLSerdash, Jan 12 '22 at 12:05

How much confidence to have data is correct based on checking N samples

0 Answers0