3

I have a data source that allows me to pull big sets of numbers (~1e12) with unknown distribution. let's define Mostly distinct as more than MD percent of the population is distinct numbers.

For each population, I need to decide if it is Mostly distinct or not, how can I estimate the probability of Mostly distinct of the population based on the sample? How does the confidence interval changes based on the sample size?

i.e.

Population size: 1000

MD = 0.99 (99%)

sample: [1,2,3,4,5,6,7,8,9,9] (size=10, 9 distinct values)

How can decide if the original population is Mostly distinct hence have more than 990 (1000*MD) distinct values? What is the confidence interval for that?

I've seen those questions that are near but not exactly my issue plus, most of them are unanswered:

Estimate number of unique items by number of duplicates in a sample

Estimating population size from the frequency of sampled duplicates and uniques

Estimating Unique Population Sizes

Estimating population size from the frequency of sampled duplicates and uniques

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
yossico
  • 131
  • 4
  • I believe the law of large number will be helpful here – yoav_aaa Nov 14 '19 at 14:32
  • I understand your populations are already coded in a database, so if this is small there should be no need of sampling or estimation, just count the number of unique members exactly. But: Can you clarify: Is your databas(es) the complete populations? Are they all very large? distributed? Is there restrictions on how you can do the samplig? (maximal sample size, batch sizes, ...) – kjetil b halvorsen Nov 14 '19 at 15:46
  • @kjetilbhalvorsen the population can come from a variety of sources. The population can be huge (sometimes hundreds of billions) and for huge populations - resources related restrictions prevent me from processing them all- that's why I need estimation. The set is a complete instance of something. It is not distributed. – yossico Nov 14 '19 at 15:54

0 Answers0