Fair Coin Testing (combine the results or not)

Question

Let's assume that you saw your cousin flipping the coin 20 times and he got 12 heads and 8 tails. You see your cousin flipping the coin 20 more times and want to test if the coin is fair or not.

Let's consider the following two cases:

1) the cousin flipped the coin 20 times (didn't care how many heads or tails come up)

2) the cousin flipped the coin UNTIL he sees at least 4 more heads than tails or 4 more tails than heads, and that number came out to be 20 flips.

In each of these two cases, how would you test it? Would you combine the results from the next 20 coin flips that your cousin does or ignore the previous flips and just consider the next 20 flips?

Any reference would be greatly appreciated.

This sounds like a Silicon Valley interview question. In case (1), the number of heads (or tails) would follow the Binomial(n=20, p) distribution. If you use the Binomial p.m.f. with k = 12 and solve for p, shouldn't you expect to end up with ~0.5? (not sure how that would prove anything though unless n was much bigger). Case (2) would be the same approach using the negative Binomial p.m.f. (There's also a chance that I'm completely off). — Digio, Jul 11 '17 at 22:57
Was it seeing the outcome of the first set of flips that made you decide to test for fairness? The wording sort of suggests it, and knowing whether that's the case is important to a good answer. — Glen_b, Jul 12 '17 at 02:43
@Glen_b I didn't even think that would be important. If so, I would like to know how that affects the hypothesis testing in each case. — user98235, Jul 12 '17 at 06:39
Converted comments to an answer (at least it answers part of the question if the premise holds) — Glen_b, Jul 12 '17 at 08:19
@Digio I believe for case 2 the PMF would be a hitting-time distribution rather than a negative-binomial. See my answer. — GeoMatt22, Jul 14 '17 at 00:44

Glen_b · Accepted Answer · 2018-02-02T08:21:14.803

[Since this doesn't deal with the main two questions (it relates more to their premises - answering the subsequent question in the body text), this is perhaps only a partial answer. Someone trying to answer 1 and 2 would presumably need to make the assumption that (in spite of the wording) the desire to test did not come from observing the first set of flips to deal with the numbered questions. In part I see the problem with the phrasing of the setup and the numbered questions as focusing on the details of estimating probability while ignoring the larger issue relating to severe bias, making the apparent distinction in the two cases largely beside the point.]

In short, I focus here on "Would you combine the results from the next 20 coin flips that your cousin does or ignore the previous flips and just consider the next 20 flips?" and I discuss it in detail as requested -- indeed, it's easily the most critical part here, and the part that's very widely misunderstood, occasionally even among statisticians. All those "my third cousin's birthday is the same day as my girlfriend and they were born in the same year in the same town, what are the chances??" questions are examples of this sort of problem, yet at the same time, it affects a lot of published work (that is, where the observed data impact the hypothesis tested).

If the desire to test the hypothesis is created by seeing the outcome of the first set of flips you shouldn't include them in your test -- the wording of the question suggests it. That's a data-generated hypothesis

Otherwise you bias your p-values downward; you'll be much more likely to reject nulls than you should be when there's nothing going on.

Let's take a slightly more extreme case to make the point more clearly.

Imagine 100,000 people all toss (fair) coins 20 times. The ones that get very high (18+) or low numbers (0-2) of heads (we expect around 20 of each) decide to test those coins and the ones that got middling numbers do not.

The ones that do test combine their first 20 tosses with another 20 tosses and test at the 5% level. What's the probability that they reject the null?

(NB if the test works as it should, given they have a fair coin in this setup, it should be 5%... but it's waay bigger). That's what testing hypotheses suggested by the data do.

I just did it in simulation and got 42 people (who decide to test), and of those 32 of them went on to reject the null after combining their data. This is when the coin is fair! (larger simulations give similar results)

Test on the second set (i.e. data collected after the first set made you want to test the hypothesis) & you're okay.

(+1) Good point! I tried to explore your final suggestion further in my own answer ... hopefully it is coherent :) — GeoMatt22, Jul 14 '17 at 00:43

score 4 · Answer 2 · answered Jul 14 '17 at 00:42

The answer by Glen_b does a good job highlighting the possible dangers of combining the two data sets to test coin fairness (i.e. if performing the second experiment was contingent on the first one's results).

Here I consider how to test under the alternative experimental conditions, assuming we use only the second data set. That is, given the data of $N=20$ flips with $K=12$ heads, generated either by

fixing $N=20$ ahead of time
flipping until the difference between heads/tails is $m=4$ (fixed ahead of time)

how consistent is the data with a fair coin?

In the fixed $N$ case, the number of heads $k$ is binomial distributed

$$\Pr\big[k\big] = \mathrm{Binom}_{N,p}\big(\,k\,\big)$$

where $k=0,1,\ldots,N$. Under the null hypothesis of $p=\tfrac{1}{2}$, the expected number of heads is $\langle{k}\rangle=10$, and the exceedance probability of the data is $\Pr\big[|k-\langle{k}\rangle|\geq|K-\langle{k}\rangle|\big]=50\%$.

In the fixed $m$ case, the number of flips $n$ is now a random variable, known as the hitting time. Under the null hypothesis the head/tails difference is an unbiased simple random walk, with hitting time distribution

$$\Pr\big[n\big] = \tfrac{m}{n}\,\mathrm{Binom}_{n,p}\big(\,\tfrac{n+m}{2}\,\big)$$

where $n=m,m+2,\ldots,\infty$. For a biased coin the hitting time will be shorter (since the boundaries $\pm{m}$ are symmetric). So we can measure consistency as $\Pr\big[n\leq{N}\big]=38\%$ under the null.

So for the same data, produced by different experimental setups, give slightly different p values.

A few notes:

The second setup does not follow a negative binomial distribution (this would require fixing $m$ heads, rather than excess heads).
Probably the first and second data sets could be combined to compute a consistent Bayesian posterior for $p$. In the case of a fixed-$N$ second experiment, the number of heads and tails could then literally be combined. For the fixed-$m$ second experiment, things are more complicated (e.g. for $p\neq\frac{1}{2}$ the $+m$ vs. $-m$ cases would differ and must be averaged together to get the total likelihood).

Fair Coin Testing (combine the results or not)

2 Answers2

Linked