Proper way to test hypothesis of random selection?

Question

Suppose I have $N$ urns, each containing various mixes of red and green balls. A subject is to make a random selection without replacement of $M$ balls from each of the urns, whereupon a count is made of the red and green for that urn, resulting in a count of these for each urn.

The hypothesis is the selection was random, vs the subject peeking and selecting one color preferentially over all the urns.

Would summing the probabilities for all possible selection permutations over the urns (that total to the grand total of the subject's selections or more extreme) be a proper significance test, or should each selection be tested that way and some kind of multiple test correction/meta-analysis be used to arrive at an overall result, or...?

Edit: A toy example for clarification:

Suppose there are only two urns. Urn 1 has 10 red & 10 green while urn 2 has 12 red & 13 green. Then subject makes 5 draws without replacement from each urn, and reports 2 red for urn 1, 0 red for urn 2. The low value of red count raises suspicion that subject peeked, and picked green preferentially.

Using the first test idea, I take the possible permutations of urn counts that could lead to a total of 2 red or less - {{0, 0}, {1, 0}, {0, 1}, {2, 0}, {0, 2}, {1, 1}}, calculate the individual probability products, and sum those, arriving at the probability of ~0.04 of getting 2 or fewer total red for the draws if the draws were actually random.

For the second, I'd calculate the probability of getting 2 or fewer for urn 1 (0.5), and that of getting zero for urn 2 (~0.024) and do a meta-analysis on those p-values (say using Fisher's method), getting ~0.07.

Both methods seem reasonable, but arrive at opposing results for significance - the first is significant at the 0.05 level, the second is not.

Thoughts?

Given two valid statistical tests, usually the test with greater power would be preferred. However, this is complicated by the fact that the power depends on the specific alternative being considered. You may find it helpful to formulate a parameteric family of specific alternatives (where the parameter(s) measure the expected extent of "peeking" or something along those lines). Then you could use theory and/or simulations to see which of the two tests is more powerful under various alternatives. — Brent Kerby, Feb 10 '16 at 06:47
I think we first have to really specify in detail what the situation is here. That is, we need a stochastic model for the outcome of the experiment and an unambiguous statement of the null and alternative hypothesis. For example: (1) can the subject peek in one urn and not in the others, or can we assume peeking is done for all or none of the urns, (2) when peeking does the subject always favour the same color, or can that be different for each urn? (3) Do all urns have the same number of green and reds? (4) ... — StijnDeVuyst, Feb 14 '16 at 14:20

score 2 · Answer 1 · answered Feb 13 '16 at 10:19

Summing over all the probabilities that are as or more extreme is perfectly sufficient here, but there are a few things to keep in mind. First, you have to be careful about what you mean by "as or more extreme" here. If you have a particular reason to believe a cheater would prefer picking green over red, then the probability that two or fewer red would be chosen by chance may suffice, but often you will want to use two-sided p-values: i.e. what are the chances that two or fewer red or green would be picked by random chance? (which here would give a p value greater than 5%)

Another important point is as djma mentioned: .05 p-value is arbitrary and has created some serious problems for the reproducibility of scientific experiments. If you set a p-value of .05 as the bar, of true hypotheses you will reject on average one of every twenty! And so really, you should set your threshold for significance on a number of factors, particularly:

How bad would it be to reject a true null hypothesis? (In your toy model, this would be considering the consequences of accusing someone falsely of cheating)
How bad would it be to accept a false null hypothesis? (In your toy model, this would be the consequences of failing to catch a cheater)
Prior probabilities: a priori, how likely do you consider it is that the null hypothesis is false?
Are you subject to the look-elsewhere effect? (In your toy model, if you are a casino and you are subjecting many people to this test, at a low significance level you are bound to falsely accuse many of cheating, but a lower significance level may be more reasonable if this is a one-off thing)
Many more points I'm sure I'm not thinking of.

score 0 · Answer 2 · answered Feb 12 '16 at 23:14

In your toy example, the first method is more appropriate since you can calculate the true p-value without any approximations. Meta-analyses like Fisher's throw out information about the problem and therefore must be a less powerful test.

Adding to Brent's answer, you should choose a statistical test T1 over T2 when T1's type 1 error and type 2 error are lower. Otherwise, it's a judgement call / domain specific.

The 0.05 p-value for significance is arbitrary and plagues scientific research. One shouldn't be too attached to it. The more unbelievable that a hypothesis rejection seems, the lower the p-value you'd need to get excited about your research. In the toy example case, 0.04 and 0.07 aren't that different.

Proper way to test hypothesis of random selection?

2 Answers2

Linked