2

Suppose I take 2 sets ($A$ and $B$) of 1000 random items out the same distribution; I also suppose that all items are different. I then create a new set $C$ which is the union of $A$ and $B$, since all items are different it has 2000 elements. If $C$ is sorted (ascending or descending), should I roughly end up with a sequence: $a_0,b_1,a_2,b_3,a_4,b_5,...a_i,b_j,...b_{2000}$ ($a_i$ is an element of $A$ and $b_j$ is an element of $B$?

Or alternatively should the number of runs (defined by Wald-Wolfowitz) be almost equal to number of elements in $C$? Here a run is a follow up of elements from the same set. For example:

  • $a,b,a,b$ has 4 runs
  • $a,a,a,b,b$ has 2 runs
  • $a,b,b,b,a,a$ has 3 runs

If this would not be case, what's the reason behind it?

To give a more practical setting, I also implemented this now and asked it on stackoverflow.

warreee
  • 123
  • 3
  • Some or same? Either way there is not information to determine the sequence. – Michael R. Chernick Apr 26 '17 at 15:15
  • The same distribution, I changed it. – warreee Apr 26 '17 at 15:28
  • 1
    This seems clear enough to me. I'm voting to leave open. – gung - Reinstate Monica Apr 26 '17 at 16:15
  • @gung Do you know what the other eliminates in the sets are either before or after they are merged? – Michael R. Chernick Apr 26 '17 at 17:29
  • @MichaelChernick I'm not sure what you mean, but when the sets are merged, the union is taken and no elements are eliminated. – warreee Apr 26 '17 at 20:41
  • I think the question has changed considerably. Is a the only element in A and b the only element in B? If so the union of the two sets is just {a, b}. I don't think the Wald-Wolfowitz run test came up in the original question. I am still puzzled as to how the elements are randomly drawn. Are you drawing them out of C 1000 times? If so is this done without replacement? I still don't see how the question is clear and I don't see what interpretation is being used by those who gave answers. – Michael R. Chernick Apr 26 '17 at 20:58
  • I made the confusion around the union clear. The elements are drawn like you would do with python: `A = np.random.chisquare(1, 1000)` and `B = np.random.chisquare(1, 1000)`. – warreee Apr 26 '17 at 21:54
  • This is the situation analyzed in the first page of Part I of A. M. Mood, *The Distribution Theory of Runs* (1940): https://projecteuclid.org/euclid.aoms/1177731825. – whuber Apr 26 '17 at 22:22
  • Consider this thought experiment: watch the sequence as it is created. At any stage, roughly half the sequence consists of $a$'s and half of $b$'s. Until the end, there is about a 1/2 chance that the next element in the sequence will be an $a$ and a 1/2 chance it will be a $b$, and there is a 1/2 chance the *current* element is an $a$ and 1/2 chance it is a $b$. Therefore there is a 1/2 chance that the next element will be the same as the current element--continuing a run--and a 1/2 chance it will be different, creating a new run. Thus the expected number of runs is close to $2000/2=1000$. – whuber Apr 27 '17 at 13:43

2 Answers2

1

Cool question! If we generalize your framework and let $n$ represent the number of items in each set, then we can make some headway toward finding the solution. In your case $n = 1000$. But what if we let $n = 1$? There are always ${2n} \choose {n}$ total arrangements so in the simplest case where $n=1$, there are just two possibilities. Both have 2 runs. Now, consider $n=2$. There are ${2n} \choose {n}$ $= 6$ arrangements and the average number of runs is exactly 3. Going further, $n=3$ leads to 20 arrangements and an average of 4 runs. At this point, it seems unlikely that the answer will ever differ from $n+1$ as the average number of runs. This can probably be proved by induction pretty easily.

jjet
  • 1,187
  • 7
  • 12
  • your answer actually says I'm wrong because it would mean I end up on average with a,a,b,b,a,a,b,b ... because C has 2000 elements in it and not 1000? – warreee Apr 26 '17 at 21:35
  • I'm not sure I understand what you're asking. A and B both have 1000 elements. C is a concatenation of the two which is subsequently permuted randomly. Thus, C has 2000 elements. You're original guess of an average number of runs of 1000 was wrong...but just barely. The answer is actually 1001. – jjet Apr 27 '17 at 20:48
0

By comparing your problem to the Mann-Whitney test I guess you are correct, in the sense that the run lengths should not be much different from one. This test is based on summing the ranks of each set (a or b in your case) and then looking at the difference.

Leo Martins
  • 236
  • 1
  • 4