4

I came across this: a categorical distribution with $K=10,000$ parameters (categories), and we take only few samples from this distribution, say $N=400$ (the point is $N < K$). Now, obviously, not all categories will show up in the sampled data, so let's call number of categories observed $x$, $x$ is a random variable, what do you call the distribution of $x$?

PS: I came across this while reading about Dirichlet process derivation.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
John Deterious
  • 235
  • 1
  • 6

1 Answers1

3

Let $X_1, \dotsc, X_N \sim \mathcal{Catdist}(K,\pi_1,\dotsc,\pi_K)$ (and interest is in case $N<K$). Let $Y_j=\sum_i \{X_i=j\}$ be the count of the number of times the value $j$ was observed. Now $(Y_1, \dotsc,Y_K)\sim\mathcal{multinom}(N,\pi_1,\dotsc,\pi_K)$.

Your question is about the random variable $B=\sum_{j=1}^K \{Y_j \ge 1\}$, I doubt there is a conventional name for this distribution. But in the special case that all $\pi_i=1/K$, this general problem is known as balls in boxes and something might be known. Have a look at this math SE post. There is also some similarity to occupancy problem in ecology.

There is also connection to the Coupon collector problem.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467