0

Imagine I put people into different groups based on a uniformly distributed random variable $y = f(x)$ (e.g. microseconds of their arrival to a website). After a while, I observe how many people are in group 1, group 2, etc. Say I have 100 groups, one for each of 100 microsecond timestamp values.

If I sampled 10000 people in total, I would expect 100 in each group, following a uniform distribution. Now, what differences in group size should I expect by random chance alone?

My mind is drawing a blank here - been out of probability classes for too long! Thinking of the $y$ as $U(1,100)$, I think my question relates to the expected variance of $z = y^{-1} = U^{-1}(0,100)$. This Wikipedia page highlights the inverse uniform distribution (for the continuous case, assuming the lower bound is strictly positive). The variance given there is

$$ \frac{1}{a*b} - \left( \frac{ln(b) - ln(a)}{b-a} \right)^2 $$

where $a = 1$ and $b = 100$ are the bounds.

Where my mind is blank is here:

  • Am I even right here? Since I want to know the 'to be expected' differences in the number of people in each group, I think I am asking for the variance of the pdf values of a uniform distribution, though I could have gotten this wrong.
  • How do I scale this back to 'number of people' from the frequencies? Is this a simple multiplication with 10000, or am I overlooking something?
coffeinjunky
  • 1,646
  • 1
  • 16
  • 22
  • 1
    Could you please explain how the "EDF" in your title is connected to the "group size differences" in the question? – whuber Jan 17 '20 at 00:01
  • 1
    @whuber If I sample from a uniform, don't we call the realised frequencies the empirical distribution function? The pdf of a uniform is constant, but if I sample from it, I will get differences for different points on the support. That's at least my reasoning. If I got this wrong or a different terminology would be more suitable, please advice on what would be better. I'd appreciate it. – coffeinjunky Jan 17 '20 at 08:51
  • Also, @whuber - how does multinomial and chi-squared fit in here? – coffeinjunky Jan 17 '20 at 10:33
  • 1
    When you binned the data you created a multinomial distribution. Your question is answered by the theory that underlies the chi-squared test. One explanation (probably not the best, but a complete and rigorous one) is in my answer at https://stats.stackexchange.com/a/17148/919. The EDF is usually understood as the *cumulative* empirical distribution, whereas your question focuses on the multinomial frequencies. – whuber Jan 17 '20 at 13:43
  • Thanks a lot @whuber. I corrected the EDF to frequency instead so that the title reflects my problem better. Would it make a difference that the binning happens naturally? As in, timestamps have a pre-specified number of digits for microseconds, which are used to label the groups. In that sense it is the discrete version of the uniform, even though I appreciate that my Wiki link above mainly uses the continuous one. I guess I am wondering when a sample from a uniform becomes a multinomial? – coffeinjunky Jan 17 '20 at 15:39
  • The issue about what the binning depends on is interesting, but here I believe "happens naturally" may be a little misleading, because your description suggests the binning could be figured out without looking at any data: it is a function of the *nature* of the data (namely, how they are recorded) more than anything else. If that's correct, you have no worries. A sample from *any* distribution is multinomial when you tally it in predefined bins. – whuber Jan 17 '20 at 17:34

0 Answers0