Another p-value fallacy

Question

Wikipedia examples say that a coin is unfair if it generates a sequence 1111111. A high alternation rate, such as 1010101010, would be similarly unfair.

What is a fallacy to think that a coin is unfair on the grounds that it is equally improbable to see any sequence?

I mean that it is normally "resolved" by stating the fact that P(11111111) = P(00000000) = P(01010101) = P(10101010) = P (010101000) = P(any other sequence). This explains that the coin is fair. But, I interpret it as there cannot be any fair coin just because of identity of all probabilities and because P(all ones) = P(any sequence) etc, all being highly improbably.

Now, we have that fair coin cannot be fair. Where is the fallacy and how to apply the p-test properly?

edit The hypothesis is that the coin is fair and statistic is the probability of occurring the sequence. I can compute the probability of occurred sequence from the hypothesis. For any sufficiently long sequence, the theoretical probability is too low and, therefore, fairness must be rejected. Where is the fallacy?

edit2 Why nobody can simply say that the pitfall is pointed out in the first Wikipedia example: p-criteria does not take the sample size into account? I can even trivialize the problem. Forget the series. Let's evaluate the probability of picking a single item38 under assumption of uniform 0-100 distribution. Obviously, it is 1%, which is sufficiently low to be picked by chance. But, statistics shows that item appears in 100% of the cases (1 time per 1 experiment). This obviously cannot be by chance, according to p-level test, yet, sample size is also insufficient. So, p-test must be complemented by sample size analysis. It is a fallacy to forget this. Right?

A related question: which distribution will have the probability of picking item38 if I draw multiple samples? How do I take the integral of "extreme cases"?

It certainly is equally likely to get the path HHHHHHHHHH as it is to get the path HTHTHTHTHT, however it is not equally likely to get 10 Heads as it is to get 5 Heads and 5 Tails. There are many, many more paths to 5 heads and 5 tails than there are to 10 heads or 10 tails in 10 flips. It's not that any one singular path is any more likely than any other path, but rather that the outcome closer to the mean outcome (50 % heads and 50% tails) is far more likely than one with an extreme outcome (10 straight heads for example). — Eric Peterson, May 28 '13 at 11:51
Should I ask why we sum outcomes instead of considering individual paths at the philosophy portal? Look, I understand that the path checking count(Heads) can make difference between fair and unfair coins and we are looking for that whereas P(any path) is identically low. But p-value criteria says that we need just test some p-value. And why P(path) cannot be this value? — Val, May 28 '13 at 12:00
For a p value you need a test statistic and a hypothesis. If you would, please state those two things in your initial topic and I am sure someone will help you clarify the problem. — IMA, May 28 '13 at 12:46
Deciding how the test statistic should partition the sample space is not a 'philosophical' issue but a very practical matter. (NB We don't always sum counts of Heads & Tails.) My answer [here](http://stats.stackexchange.com/questions/44769/understanding-p-value/44859#44859) illustrates how you can choose a statistic to test different kinds of unfairness in the coin. — Scortchi - Reinstate Monica, May 28 '13 at 13:11
In response to your last edit - "statistic is the probability of occurring the sequence". As you've already said, all sequences have the same probability of occurring under the null hypothesis; therefore if you take the probability of an individual sequence under the null hypothesis as a test statistic the p-value will invariably be equal to one. Hence the need for a test statistic which partitions the sample space into regions of differing probability under the null hypothesis. — Scortchi - Reinstate Monica, May 28 '13 at 15:23
Wait. Why one? There is 1/1024 probability for the sequence of length 10. — Val, May 28 '13 at 15:39
A p-value is the probability (under the null hypothesis) of getting a test statistic greater than *or equal to* the one you actually got. (Or less than *or equal to* if you use it the other way round.) If you use the probability under the null itself as the test statistic then it will be equal, in this case, for any outcome, & therefore the p-value will be one. This definition is at the top of the page you linked to. — Scortchi - Reinstate Monica, May 28 '13 at 15:44
You can interpret the phrase "probability of any sequence" in two ways: either $P(11111) + P(11110) + ... + P(000000) = 1$ or as probability of $P(11111)=P(11110) = P(any seq) = 1/2^n$, that you have received. These are two different values. The first one is 1, another is $1/2^n$. Ok? Both are probabilities of obtaining any sequence. I am taking the second as a criteria for the p-value test. Once n is sufficiently large, this probability approaches 0, and fairness rejected. — Val, May 28 '13 at 15:56
Prior to performing the experiment, what is the test statistic? — Scortchi - Reinstate Monica, May 28 '13 at 16:06
Ok, I finally understand. Single experiment = empirical probability of sequence = 1. But, my fallacy is still not resolved. Ok, we have statistic of 1. But, according to the hypothesis, it is very improbable. — Val, May 28 '13 at 16:35

Scortchi - Reinstate Monica · Answer 1 · 2018-05-05T21:17:42.353

A test procedure goes like this:

(1) Define the sample space: 1024 outcomes of tossing a coin 10 times

(2) State the null hypothesis: A fair coin; i.e. $\mathsf{H}$ & $\mathsf{T}$ equiprobable, tosses independent

(3) Define a test statistic: You can use the sum of heads, or the number of runs, or whatever you like

(4) Perform the experiment & calculate the observed value test statistic: Toss the coin 10 times

(5) Calculate the probability (under the null hypothesis) of getting a value of the test statistic greater than or equal to the observed one.

The result from (5) is the p-value. It lets you calibrate the test statistic. Suppose the null hypothesis were indeed true: if you were to follow this test procedure many times & reject the null hypothesis (wrongly) every time you got a value of the test statistic this big or bigger, you'd reject it (wrongly) a fraction $p$ of the many times.

The tricky part is (3). What's right about your intuition is that every particular sequence can be seen as favouring some alternative or other against the null—there are so many different ways a coin can be unfair. But you have to choose a test statistic that gives you some discrimination. The count of heads is a good one if you want to test whether the probability of heads is different to one-half, & are not so doubtful of independence. The count of runs of the same side up is a good one if you're more concerned about independence. If someone tells you they're going to toss $\mathsf{HHTHTHHHTT}$ then to test their ability you can let your test statistic equal one when just that sequence arises, and zero otherwise. What you can't do is look at a particular sequence after the experiment, say it would have been extremely improbable according to some test statistic or other, & quote a p-value based on that.

[In response to your comment:

(a) The p-value of $\mathsf{HHTHTHHHTT}$ is not in general $\frac{1}{1024}$, but depends on the test statistic being used. If the count of heads is being used as the test statistic (as it is when the alternative of interest is that the probability of heads is greater than $\frac{1}{2}$), the more extreme cases are counts of 7, 8, 9, & 10, & the probabilities of these counts would be summed into the p-value. I gave an example of someone's saying they intended to toss $\mathsf{HHTHTHHHTT}$, & in this case, but certainly not in all cases, it would be sensible to define the test statistic such that $\mathsf{HHTHTHHHTT}$ was the most extreme value.

(b) You can calculate what probabilities you like before & after the experiment, but valid p-values are derived from a test statistic defined beforehand, or at any rate independently of the observed results. If you choose your test statistic depending on the observed results, you're following a different procedure from the one described above, & the interpretation in terms of error rates over hypothetical repetitions—which is the whole point of introducing p-values—will no longer be relevant.

(c) I can't follow your argument on sample size at all. An exact p-value will be valid regardless of sample size.]

Two questions. p-value of HHTHTHHHTT is simply 1/1024. Which (more) extreme cases does have to sum up? What is the problem of computing the probability of event aposteriory? I feel like the problem with p-test is not it aposteriorentess but insufficient sample size. Single 10-coin toss just means a single sample. This is obviously insufficient to make any statistical conclusions. But, p-test does not take this into account. Right? — Val, May 28 '13 at 17:33

score 4 · Answer 2 · answered May 28 '13 at 11:58

I don't think this is anything to do with P-values. In any case you nowhere specify what test you have in mind.

The usual definition of a fair coin I take to be that heads and tails are equally probable, but nothing rules out "fairness" being a vague concept that can be made precise in several ways. In practice one should also be wary of -- to take one example -- perfect alternation between heads and tails and suspect the coin -- or more to the point perhaps, the machine or person tossing it. Other kinds of regularity can also be imagined.

In that case, the thing to do is to set up a test specific to that kind of behaviour and calculate P-values (or preferably some kind of confidence interval for a key parameter). Alternatively, go as Bayesian as you want.

So, I don't see anything here except the idea that the wrong question can give you an irrelevant answer.

Good answer, but it's not that 'fairness' is hard to pin down - 'equiprobable & independent outcomes' seems to do the job - but that the multifariousness of *unfairness* forces us to focus on specific directions of deviation. Fair coins are all alike; every unfair coin is unfair in its own way. — Scortchi - Reinstate Monica, May 28 '13 at 13:39
So you read Tolstoy, or have seen the movie (allusion in last sentence). I don't think we disagree, as defining unfairness has to be the complement of defining fairness. — Nick Cox, May 28 '13 at 14:36

score 1 · Answer 3 · answered May 29 '13 at 12:30

What's the implicit model?

The probability p = 1/1024 is derived from the fair coin model of $\Pr(H)=\Pr(T)=0.5$ with independent throws, i.e. $Cov(n_i,n_{i-1})=1$. Note that this model is invariant to the sequence of throws, i.e. $\Pr(HHT)=\Pr(HTH)=\Pr(THH)$.

Under this model, a sequence such as HTHTHTHT is suspect, because it violates the second condition - the throws doesn't display independence.

However, OP asks why we shouldn't consider any n-tuple to be suspect, as the probability of any n-tuple is $1 \over2^n$, which for $n>5$ is below the "traditional" cut-off of $p<0.05$. What OP fails to recognize in this question is that he is using a model of individual throws to model the p-value of n-tuples.

In my opinion, the appropriate model is $H_0:\Pr(T_{k,i})=\Pr(T_{k,i}) \forall i,j$ versus $H_A:\exists i,j: \Pr(T_{k,i}) \neq \Pr(T_{k,j})$, where $T_{k,i}$ is the i'th k-tuple. This follows a multinomial distribution with k outcomes, and the test statistic can be derived.

To elaborate: we should also consider the difference between the pre-hoc and post-hoc probability. If you were to tell me in advance that you would roll 10101010, and you did, I would certainly be astonished. However, if you were to tell me that you did roll k-tuple, and you did, I most certainly would not, as the chance of this event under the null is 1. — abaumann, May 30 '13 at 08:31

IMA · Answer 4 · 2013-05-28T13:06:55.463

On the chance of misunderstanding you, I will guess about what you might mean.

I guess your hypothesis is that the coin is fair, right? And your problem is how to test this if, in fact, all sequences by themselves have the same probability to occur. However when we are talking about p-values, this usually doesn't matter. At least not for the coins minted from the central bank of statistics books, if you will.

Well, if the coin if fair then we can (for example) assume that the throw is a random experiment which is Bernoulli distributed. If we assume that, then we can state that the result of "1111111..." is a result of a n-Binomial distribution of n throws of that coin.
And our hypothesis is then that the experiment, throwing the coin n-times with k successes stems from a binomial probability distribution. In this case, k=the amount of "1" we got, or our successes.

So we can test the hypothesis and we will see that it is highly unlikely to achieve the amount of successive "1" we got.
In fact the probability for this is NOT equal to all other sequences, since we are looking at the amount of successes with the binomial distribution. In fact the fact that the "1" are successsive doesn't matter for us. The order doesn't matter and therefore the probability of a sequence doesn't matter.
A binomial distribution with p=0.5 will have the most likely event that there are just as many heads as there are tails. Like so, where in 15 throws the most "expected" outcome is 7.5:

This way, we could pretty accurately deny the hypothesis that the coin is fair and there would be no fallacy by using a simple binomial test. http://en.wikipedia.org/wiki/Binomial_test

Your confusion seems to stem from the assumption we are testing not the amount of heads but rather the sequence. In this case you would be right: Every sequence is equally likely. But there are many more of these sequences that generate k throws with "1" if k is smaller than n, so with our fair coin we assume to have, our series of only "1" is unlikely precisely BECAUSE all throws have the same probability to go "1" or "0"

Well, why are we testing the count of heads and not the sequence? Why doesn't the order matter? — Scortchi - Reinstate Monica, May 28 '13 at 14:06
Because of the underlying hypothesis, that is that P(H) = P(T). It would do perfectly well for us to test some sequence or another, but that's another matter. To state it in another matter, we interpret P(H) ={frequency of heads}/{total number of throws}. Here, we assume some things - such as independence between throws. — abaumann, May 28 '13 at 16:32
Which is begging the question; the OP gives the particular example 1010101010. — Scortchi - Reinstate Monica, May 28 '13 at 16:41
The order doesn't matter because of the hypothesis normally under consideration. We could very well test the probability of, say, HTHTHT in strings of length 8, but that's another hypothesis. — abaumann, May 28 '13 at 18:50
To explicate: we can consider three different scenarios: testing the frequency of P(H), testing the autocorrelation or testing the occurence of certain strings, such as HTHTHT. They're three different hypotheses. In the second case, 1010101010 is suspect, because of the perfect correlation between throw n and throw n-1. — abaumann, May 28 '13 at 20:38
Which is exactly what needs to be made explicit in this answer for it to be a good one to the original question. — Scortchi - Reinstate Monica, May 28 '13 at 22:39

Another p-value fallacy

4 Answers4