Suppose I have $N$ urns, each containing various mixes of red and green balls. A subject is to make a random selection without replacement of $M$ balls from each of the urns, whereupon a count is made of the red and green for that urn, resulting in a count of these for each urn.
The hypothesis is the selection was random, vs the subject peeking and selecting one color preferentially over all the urns.
Would summing the probabilities for all possible selection permutations over the urns (that total to the grand total of the subject's selections or more extreme) be a proper significance test, or should each selection be tested that way and some kind of multiple test correction/meta-analysis be used to arrive at an overall result, or...?
Edit: A toy example for clarification:
Suppose there are only two urns. Urn 1 has 10 red & 10 green while urn 2 has 12 red & 13 green. Then subject makes 5 draws without replacement from each urn, and reports 2 red for urn 1, 0 red for urn 2. The low value of red count raises suspicion that subject peeked, and picked green preferentially.
Using the first test idea, I take the possible permutations of urn counts that could lead to a total of 2 red or less - {{0, 0}, {1, 0}, {0, 1}, {2, 0}, {0, 2}, {1, 1}}, calculate the individual probability products, and sum those, arriving at the probability of ~0.04 of getting 2 or fewer total red for the draws if the draws were actually random.
For the second, I'd calculate the probability of getting 2 or fewer for urn 1 (0.5), and that of getting zero for urn 2 (~0.024) and do a meta-analysis on those p-values (say using Fisher's method), getting ~0.07.
Both methods seem reasonable, but arrive at opposing results for significance - the first is significant at the 0.05 level, the second is not.
Thoughts?