I have a data source that allows me to pull big sets of numbers (~1e12) with unknown distribution. let's define Mostly distinct as more than MD percent of the population is distinct numbers.
For each population, I need to decide if it is Mostly distinct or not, how can I estimate the probability of Mostly distinct of the population based on the sample? How does the confidence interval changes based on the sample size?
i.e.
Population size: 1000
MD = 0.99 (99%)
sample: [1,2,3,4,5,6,7,8,9,9] (size=10, 9 distinct values)
How can decide if the original population is Mostly distinct hence have more than 990 (1000*MD) distinct values? What is the confidence interval for that?
I've seen those questions that are near but not exactly my issue plus, most of them are unanswered:
Estimate number of unique items by number of duplicates in a sample
Estimating population size from the frequency of sampled duplicates and uniques
Estimating Unique Population Sizes
Estimating population size from the frequency of sampled duplicates and uniques