2

I want to characterize the simple algorithm for selecting, say, 20% of a discrete set when the total size of the set is not known:

foreach item in the set
   generate a random non-negative integer R
   if (R mod 100) < 20 then
      select item

Does that algorithm have a name? Is it the same as Algorithm $B\mathcal U$ in Binomial random variate generation? (The algorithm there has "generate $u \sim \mathcal U(0,1)$" instead of a random integer, and uses $\le$ instead of $\lt$. Algorithm cited here)

The probability of ending up with $k$ items after examining $n$ of them can be calculated using the binomial distribution as: $$f(n, k) = \binom{n}{k}p^k(1-p)^{n-k}$$ where $p=\frac{20}{100} = 0.2\quad$ in this example. However, plotting the resulting values for widely different $n$ results in non-comparable graphs. In the picture below, I plotted the values of $f(n,k)$, scaling the interval $[0, n]$ in abscissa to the unit interval $[0, 1]$, and multiplying $f(n, k)$ by $n/10$ in ordinate, so as to compensate for shrinking the abscissa. (That way, the "integral" of the curve appears to be preserved.) For high values of $n$ I compute the average of a few $k$'s. Since the results are rather high, I use a logarithmic scale.

Binomial distribution for p=0.2

I'd like to show the picture above in a wikipedia article. However, the explanation above smells of original research. Is there an accepted way to scale those functions so as to compare them? Are there references that justify them?

Note: cross posted to math.stackexchange, I didn't know of this site.

Shayan Shafiq
  • 633
  • 6
  • 17
Ale
  • 121
  • 3
  • 1
    "generate a random integer R" doesn't seem well defined. Do you mean random integer between 1 and 100? Or really any integer? – bdeonovic Sep 01 '21 at 20:07
  • 1
    Also generating random integer uniformly at random is basically the same as generating random float from continuous uniform distribution. The rest of the algorithm is just “throw a coin to decide” and doesn’t have special name. – Tim Sep 01 '21 at 21:37
  • 1
    This is essentially the binomial distribution, with $p=0.20$ and $n=|\text{set}|.$ – Adrian Keister Sep 01 '21 at 21:42
  • bdeonovic: I'll edit s/integer/natural number/. Tim: Yet Kachitvichyanukul and Schmeise call it "Algorithm BU". Adrian Keister: yes, exactly! – Ale Sep 02 '21 at 07:16
  • @Adrian Keister: do you want to post your comment(s) as an answer? [Better to have a short answer than no answer at all.](https://stats.meta.stackexchange.com/a/5326/) Anyone who has a better answer can post it. – kjetil b halvorsen Sep 02 '21 at 16:49
  • @kjetilbhalvorsen Sure, I guess. It'll probably get flagged as a poor quality answer, but here goes. – Adrian Keister Sep 02 '21 at 16:53

1 Answers1

2

This is essentially the binomial distribution, with $p=0.20$ and $n=|\text{set}|,$ the number of elements in the set.

Adrian Keister
  • 3,664
  • 5
  • 18
  • 35
  • Does that mean the scaled graphs renders the intrinsic nature correctly? – Ale Sep 04 '21 at 08:04
  • Well, there's nothing wrong with a logarithmic scale, as long as it's clearly called out, which you've done. For some plots, particularly those that cross many orders of magnitude, a logarithmic scale is much more illuminating than a linear one. – Adrian Keister Sep 04 '21 at 14:03