4

We have a v. large (1e6) population with unknown number of types of items. We draw a small sample (~100) of a certain size, and find that exactly one item was duplicated. The question is to estimate the number of types of items (diversity) in the infinite population.

Moreover, the distribution of types of items might match an exponential distribution (with one item being the most frequent), with unknown parameters.

I know this is v. little to work with.

My knee-jerk reflex would be to run a kind of a Monte Carlo simulation for different parameters, and use a maximum likelihood estimate.

However, do you think there is an analytical solution?

January
  • 6,999
  • 1
  • 32
  • 55
  • I'd suggest the term _distinct_ here. For example, with a sample of 100, 1 item is duplicated, 98 items occur once (are unique) and the count you want to report is 99, the number of distinct items. Oddly, the computing sense of unique meaning distinct (which perhaps goes back one way or another to the Unix command `uniq`) doesn't seem to be recorded by any major dictionary I've looked at. On your main question, there is a massive literature in ecological statistics on estimating the number of distinct species, the principles of which carry over to any set of distinct items. – Nick Cox Feb 04 '16 at 09:49
  • For example, terms like unique users/visitors appear to count different people, not the number of people who have used a site or program just once. – Nick Cox Feb 04 '16 at 09:52

0 Answers0