33

Suppose we have any kind of coin. Why should the relative frequency of getting a heads converge to any value at all?

One answer is that this is simply what we empirically observe this to be the case, and I think this is a valid answer.

However, my question is, is there some simple set of principles/axioms that apply to coin tossing from which we can derive this fact?

Maximal Ideal
  • 433
  • 2
  • 6
  • 8
    Maybe you're asking about [ergodicity](https://en.wikipedia.org/wiki/Ergodic_process)? It's the assumption that *ensemble average* = *time average*. It's not something you can prove; it's an assumption you make about the problem. – user541686 Feb 10 '22 at 07:47
  • 3
    Note that convergence is not guaranteed. [Mandatory xkcd.](https://xkcd.com/221/) Thinking of convergence instead as a growing *probability* for certain classes of results "probably" leads to an answer. – Peter - Reinstate Monica Feb 10 '22 at 15:28
  • 1
    @Peter-ReinstateMonica: The linked xkcd procedure *does* converge (trivially). – Ben Feb 10 '22 at 20:01
  • @Ben Yes, but entirely by random! – Peter - Reinstate Monica Feb 10 '22 at 20:13
  • Can you explain how "… not necessarily fair…" helps, here? What difference does that make? – Robbie Goodwin Feb 11 '22 at 00:48
  • @RobbieGoodwin That sentence fragment wasn't meant to mean anything significant. All I meant is that I am interested in any two-outcome system with $p(H) = t$ and $p(T) = 1-t$ where $0\le t\le 1$. – Maximal Ideal Feb 11 '22 at 00:57
  • 1
    Thanks and could you Edit the Question to reflect that? … to include only what was meant to mean anything significant? – Robbie Goodwin Feb 11 '22 at 01:09
  • @Ben Here's a sequence you might like better. I think it doesn't converge. For the $i$th toss of the coin, return the second most significant bit of $i+2$. For example, the first few tosses are TH TTHH TTTTHHHH TTTTTTTTHHHHHHHH, etc. This oscillates between 1/4 and 1/2 of the tosses being H forever. – Daniel Wagner Feb 11 '22 at 15:51
  • @DanielWagner: Yes, that one is quite similar to the pathological sequence in my own answer (mine being a stochastic version of the same oscillating phenomenon)! – Ben Feb 11 '22 at 20:19

5 Answers5

37

This is an excellent question, and it shows that you are thinking about important foundational matters in simple probability problems.

The convergence outcome follows from the condition of exchangeability. If the coin is tossed in a manner that is consistent from flip-to-flip, then one might reasonably assume that the resulting sequence of coin tosses is exchangeable, meaning that the probability of any particular sequence of outcomes does not depend on the order those outcomes occur in. For example, the condition of exhangeability would say that the outcome $H \cdot H \cdot T \cdot T \cdot H \cdot T$ has the same probability as the outcome $H \cdot H \cdot H \cdot T \cdot T \cdot T$, and exchangeability of the sequence would mean that this is true for strings of any length which are permutations of each other. The assumption of exchangeability is the operational assumption that reflects the idea of "repeated trials" of an experiment --- it captures the idea that nothing is changing from trial-to-trial, such that sets of outcomes which are permutations of one another should have the same probability.

Now, if this assumption holds then the sequence of outcomes will be IID (conditional on the underlying distribution) with fixed probability for heads/tails which applies to all the flips. (This is due to a famous mathematical result called de Finetti's representation theorem; see related questions here and here.) The strong law of large numbers then kicks in to give you the convergence result ---i.e., the sample proportion of heads/tails converges to the (fixed) probability of heads/tails with probability one.


What if exchangability doesn't hold? Can there be a lack of convergence? Although there are also weaker assumptions that can allow similar convergence results, if the underlying assumption of exchangeability does not hold ---i.e., if the probability of a sequence of coin-toss outcomes depends on their order--- then it is possible to get a situation where there is no convergence.

As an example of the latter, suppose that you have a way of tossing a coin that can bias it to one side or the other ---e.g., you start with a certain face of the coin upwards and you flip it in a way that gives a small and consistent number of rotations before landing on a soft surface (where it doesn't bounce). Suppose that this method is sufficiently effective that you can bias the coin 70-30 in favour of one side. (For reasons why it is difficult to bias a coin-flip in practice, see e.g., Gelman and Nolan 2002 and Diaconis, Holmes and Montgomery 2007.) Now, suppose you were to execute a sequence of coin tosses in such a way that you start off biasing your tosses towards heads, but each time the sample proportion of heads exceeds 60% you change to bias towards tails, and each time the sample proportion of tails exceeds 60% you change to bias towards heads. If you were to execute this method then you would obtain a sample proportion that "oscillates" endlessly between about 40-60% heads without ever converging to a fixed value. In this instance you can see that the assumption of exchangeability does not hold, since the order of outcomes gives information on your present flipping-method (which therefore affects the probability of a subsequent outcome).


Illustrating non-convergence for the biased-flipping mechanism: We can implement a computational simulation of the above flipping mechanism using the R code below. Here we create a function oscillating.flips that can implement that method for a given biasing probability, switching probability and starting side.

oscillating.flips <- function(n, big.prob, switch.prob, start.head = TRUE) {
  
  #Set vector of flip outcomes and sample proportions
  FLIPS <- rep('', n)
  PROPS <- rep(NA, n)
  
  #Set starting values
  VALS <- c('H', 'T')
  HEAD <- start.head
  
  #Execute the coin flips
  for (k in 1:n) {
    
    #Set probability and perform the coin flip
    PROB <- c(big.prob, 1-big.prob)
    if (!HEAD) { PROB <- rev(PROB) }
    FLIPS[k] <- sample(VALS, size = 1, prob = PROB)
    
    #Update sample proportion and execute switch (if triggered)
    if (k == 1) {
      PROPS[k] <- 1*(FLIPS[k] == 'H') 
    } else {
      PROPS[k] <- ((k-1)*PROPS[k-1] + (FLIPS[k] == 'H'))/k }
    if (PROPS[k] > switch.prob)   { HEAD <- FALSE }
    if (PROPS[k] < 1-switch.prob) { HEAD <- TRUE  } }
  
  #Return the flips
  data.frame(flip = 1:n, outcome = FLIPS, head.props = PROPS) }

We implement this function using the mechanism described above (70% weighting towards biased side, switching probability of 60%, and starting biased to heads) and we get $n=10^6$ simulated coin-flips for the problem, with a running output of the sample proportion of heads. We plot these sample proportions against the number of flips, with the latter shown on a logarithmic scale. As you can see from the plot, the sample proportion does not converge to any fixed value --- instead it oscillates between the switching probabilities as expected.

#Generate coin-flips
set.seed(187826487)
FLIPS <- oscillating.flips(10^6, big.prob = 0.7, switch.prob = 0.6, start.head = TRUE)

#Plot the resulting sample proportion of heads
library(ggplot2)
THEME  <- theme(plot.title    = element_text(hjust = 0.5, size = 14, face = 'bold'),
                plot.subtitle = element_text(hjust = 0.5, face = 'bold'))
FIGURE <- ggplot(aes(x = flip, y = head.props), data = FLIPS) +
          geom_point() + 
          geom_hline(yintercept = 0.5, linetype = 'dashed', colour = 'red') +
          scale_x_log10(breaks = scales::trans_breaks("log10", function(x) 10^x),
             labels = scales::trans_format("log10", scales::math_format(10^.x))) +
          scale_y_continuous(limits = c(0, 1)) +
          THEME + ggtitle('Example of Biased Coin-Flipping Mechanism') +
          labs(subtitle = '(Proportion of heads does not converge!) \n') +
          xlab('Number of Flips') + ylab('Sample Proportion of Heads')
FIGURE

enter image description here

Ben
  • 91,027
  • 3
  • 150
  • 376
  • 3
    We can all agree than if there is an ind. probability $p$ of getting heads on each flip, that leads to convergence. The reverse is not true, so I'm not sure what point you're getting at here. Yes, some biases will get non-convergence, but some biases won't. Therefore, as far as I can tell, this numerical experiment does not provide an explanation as to why we should expect convergence without falling back on assuming the existence of $p$. –  Feb 09 '22 at 17:56
  • 3
    Because the question concerns *almost sure convergence,* it is tantamount to asking about the ideas underlying the *strong law,* not the weak law. That distinction doesn't seem to be evident in this post. It would be nice to see these issues highlighted and clearly explained. – whuber Feb 09 '22 at 18:06
  • @user37344: As explained in the answer, the IID model with fixed $p$ is based on the assumption of *exchangeability* of the coin-flips. The point here is that exchangeability is a more primitive operational assumption for the coin-flipping setup and it is the key testable assumption that leads to the convergence result. – Ben Feb 09 '22 at 18:09
  • @whuber: It's not really clear to me that the OP is asking about the strong law as opposed to the weak, since "convergence" can mean lots of different things in a probabilistic context. In any case, I've edited to be clear that SLLN holds. Here I've decided to focus on distinguishing the standard case from a case where neither law holds. – Ben Feb 09 '22 at 18:12
  • @Ben Well, you explain the IID bit. The fixed $p$ is not implied by exchangeability. So, if your goal is just to use a weaker set of assumptions to get convergence, then fine, I suppose that works. I just don't think that in the real world anyone uses exchangeability to predict coin-flipping outcomes. It's not that hard to make the $p$ assumption and then verify statistically that behavior is consistent with that stronger assumption. –  Feb 09 '22 at 18:14
  • 1
    I think you're mistaken --- fixed $p$ is implied by exchangeability of the sequence of coin-flips (by de Finetti's representation theorem). If you have a sequence of binary outcomes, and you have established that they are IID, then there *must* be a fixed probability of each binary outcome in each trial (or the 'ID' in IID is missing). – Ben Feb 09 '22 at 18:16
  • I agree that some interpretation of the question is necessary, but I think the SLLN is forced on us by the phrase "relative frequency of getting a heads:" this refers to a *single sequence* of flips, not a sequence of probability distributions. We need to show that almost all countable sequences of flips yield convergent relative frequencies. Indeed, a discussion of the fact that *not* all sequences converge (indeed, an uncountable number fail to converge) could be particularly insightful. – whuber Feb 09 '22 at 18:22
  • 2
    @whuber: I think that's a nice way to approach the problem. As a slight caveat on that interpretation, I'm not sure you necessarily have to suck the probabilistic interpretation out of the flips themselves --- you could also interpret the question by saying that you have a sequence of random variables and then examine whether or not this converges *in probability* to a constant. So my view is that you could also validly look at WLLN here, but yes, looking at SLLN is probably more insightful in this context. – Ben Feb 09 '22 at 18:29
  • Is the notion of a sequence of outcomes being IID *conditional on the underlying distribution* Bayesian? Or does it also make sense from a frequentist perspective? – Richard Hardy Feb 10 '22 at 18:21
  • 2
    @RichardHardy: Yeah, that is a distinction that is made in Bayesian statistics; I mention the conditioning because it leads to non-independence in the (marginal) predictive distribution (see [O'Neill 2009](https://www.jstor.org/stable/27919725) for discussion). In the classical ("frequentist") approach the distribution would be considered an "unknown constant", and since it is considered constant you can reasonably say that we are *always* implicitly conditioning on it. So in the latter context you wouldn't need to specify the conditioning (and people don't). – Ben Feb 10 '22 at 20:05
  • Small mistake: exchangeability does not imply independence, so the sentence "if this assumption holds then the sequence of outcomes will be IID" is wrong, I believe. – jochen Feb 11 '22 at 05:02
  • @jochen: I'm happy to be corrected, but my understanding is that an exchangeable sequence is IID conditional on its underlying distribution (per de Finetti's representation theorem). Am I missing something? – Ben Feb 11 '22 at 05:56
12

If you assume the coin tosses are independent of each other, and that you are equally likely to obtain heads on any one coin toss, then it isn't an axiom, and in fact follows from the strong law of large numbers.

EDIT: Just to answer some of the other things in your post, statistics is built upon probability theory, so if we empirically observe the frequency of heads converging to a specific value then that's fine, but we would like the empirical observations to agree with our probabilistic model.

And just to clear up a potential confusion before it arises, the strong law of large numbers isn't just related to coin tossing, and is an incredibly versatile and useful theorem (for example, under the same assumptions listed above, we could apply the same argument to dice throwing, and even continuous measurements!)

jacobe
  • 347
  • 2
  • 9
4

I want to second the answer by jacobe, but also add a bit of detail.

Assume that there is some probability $p$ of a coin toss being a head. Assign the outcome head to a score of $1$ and tails to a score of $0$. Note that the average score from a large number of coin tosses is also the average frequency of getting heads.

The "expected" score from any given coin toss is

\begin{equation} p\cdot 1+(1-p)\cdot 0 \end{equation}

Of course, this simplifies to

\begin{equation} p \end{equation}

do to our scoring system. The law of large numbers tells us that, as we flip the coin more times, we should eventually see the average approaching to this expectation value (assuming all coin tosses are independent of each other).

Now, if your question is: why do we even believe there is a probability $p$ of getting heads, then I'm not sure there is a justification. At least, I haven't heard it. This comes back to what jacobe said about probability vs. statistics. The existence of $p$ and independence of events are assumptions for our model. If the statistical results seem consistent, that's a good sign for the model.

  • 2
    The problem with this answer at present is that it does not actually get to the issue ---i.e., the assumptions required for the setup and use of LLN. The answer assumes the required conditions for convergence, but does not disclose what those assumptions are. For example, why should there be a single probability $p$ that applies to all the coin tosses? If we are talking about possibilities of non-convergence of the sample proportion, doesn't that assumption warrant some discussion/justification? – Ben Feb 09 '22 at 17:32
  • @Ben It's nice that you put all that work into creating a more complicated situation in your answer, but take a moment to actually consider the case the OP talks about. You're about to do an coin flipping experiment. You're trying to guess whether the long-term average will converge and to what. Are you really telling me you're not going to start with the premise that an individual coin flip has some fixed probability $p$? I explicitly stated that it was an assumption which continues to look "good" so long as actual results from data are consistent. –  Feb 09 '22 at 17:41
  • The OP asks why convergence would occur, so he is interested in understanding the conditions underlying that result. So no, I wouldn't use the premise that there is a fixed probability $p$. – Ben Feb 09 '22 at 17:43
  • @Ben Your results also make assumptions about flipping probability. You add in biases, but the assumptions are not inherently weaker than mine. They are inherently less valid in the real world. It's just a different set up. –  Feb 09 '22 at 17:44
  • The problem here is that your answer does not explain where the assumption of fixed $p$ actually comes from in operational terms --- it is just asserted as an assumption for which no justification is offered. That is not terribly helpful for a person who wants to know why convergence should occur at all. – Ben Feb 09 '22 at 17:48
  • 1
    @Ben I added a comment to your answer. I'm really not sure what your point is. We can not infer the existence of $p$ from convergence alone, so I'm not sure how you're trying to get around the assumption. If I ask a real analysis professor to justify why the real number line is complete, they could rightly say it's an assumption/axiom. That may or may not be satisfying, but it is not wrong. –  Feb 09 '22 at 17:58
  • Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/134009/discussion-between-ben-and-user37344). – Ben Feb 10 '22 at 05:59
3

As long as your coin is memoryless (each flip independent),* such a limiting probability exists because randomly chosen variation tends to cancel itself out. Mathematically, this fact is represented in the theorem that indepedent variances add: $$\mathrm{Var}(A+B)=\mathrm{Var}(A)+\mathrm{Var}(B)$$ Since the variation in $A$ (or $B$) is proportional to the square root (a concave function) of the variance, variances in a sum grow less quickly than the sum itself.

The average takes a sum of random variables and then divides by the number of random variables, which is proportional to the sum. So the random variation in that average will be small.

We can formalize that argument as follows:

Remember that if a random variable $Y$ has mean $\mu$ and variance $\sigma^2$, then Chebyshev's inequality tells us $$\mathbb{P}[{|Y-\mu|>\delta}]\leq\frac{\sigma^2}{k^2}\tag{1}$$ Now, given $N$ coin flips $\{X_n\}_{n=1}^N$ (heads is $1$, tails $0$), the average number of heads is the random variable $$Y_N=\frac{1}{N}\sum_{n=1}^N{X_n}$$

Now, we only ever measure $Y$ up to some accuracy. For example, have you ever really checked more than the 6th or 7th decimal place of a number? So take $\delta=10^{-8}$ in (1). The idea is to show that $\mathrm{Var}{(Y_N)}\to0$; then (1) will tell us that $$\mathbb{P}\left[{\left|Y_N-\frac{1}{2}\right|>10^{-8}}\right]\to0$$ and so any variation in the average will be indistinguishable to us.

So, OK, let's compute that variance. Since each coin flip is independent, $$\mathrm{Var}{(Y_N)}=\frac{1}{N^2}\sum_{n=1}^N{\mathrm{Var}(X_n)}=\frac{1}{N^2}\cdot N\mathrm{Var}(X_1)=\frac{\mathrm{Var}(X_1)}{N}\to0$$ as $N\to\infty$.

(One can generalize this argument into a proof of Kolmogorov's Strong Law of Large Numbers.)


* The claim also holds exchangeable random variables, as in Ben's answer. But the proof I know (page 185) uses a martingale convergence argument, which is complicated to explain if you haven't seen it.

-3

The set of principles that applies to coin tossing from which we can derive that, are indeed axioms. That means they’re open to question only inasmuch as the whole idea.

I see a hugely important secondary question as whether this is about prediction, measurement or explanation.

Whichever matters most, how is this not a sophisticated version of the standard high-school query, has a single toss the same odds as one in a sequence and then, which sequence?

After User7344, Ben asks why there should be a single probability that applies to all coin tosses. Rather, how could there not be if “all coin tosses” are equal?

How is it not axiomatic, in- or outside chaos theory and the nature of randomness that from a choice of two outcomes with all things equal, the likelihood of either is 1/2?

At risk of being pedantic, the actual wording of the Question negates much of its value. That looks like a linguistic niggle but consider the detailed discussion that’s developed from it! I hope we all saw the same meaning but how could I prove that. In reality, there can be neither frequency nor convergence in “a coin toss”; only in a series of tosses. Take that same ambiguity back from linguistics to probability, and what certainty remains?

In my view “what we empirically observe” is a starting point, often to be explained but rarely needful of justification.

One thing I suggest most people won’t accept until they’ve tried it, is that whether convergence is what we “empirically observe” depends on how patient we are.

I built a simple roulette simulator and until someone shows how they don’t, I suggest red/black spins follow the same probabilities as heads/tails tosses.

Will you take the time to guess how long an uninterrupted string of either outcome was not uncommon?

I thought that might be four or five; perhaps six but in fact after millions of runs, nothing less than 14 turned out to be uncommon. I’m told that both the result, and my amazement at it, are frequently seen in studies of advanced maths.

If any sequence is equally likely and all are combined, how could the results not converge? How would that not describe the idea of an “average”?

If any sequence is equally likely and they are not combined, how could 13-in-a-row not skew the empirical view of whichever observer saw it?

  • 2
    Could you explicitly indicate which part(s) of this post, if any, address the question that was asked? – whuber Feb 13 '22 at 19:11
  • @whuber I read the Question as "Why should the frequency of heads in a coin toss converge (to anything at all)" Didn't you? The exposition emphasised, "is there some simple set of principles/axioms that apply to coin tossing from which we can derive this fact?" Are you saying that with or without that, this is not about axioms? What in "the set of principles that applies to coin tossing from which we can derive that, are indeed axioms. That means they’re open to question only inasmuch as the whole idea" didn't work for you? Could you at all indicate what you didn't like? – Robbie Goodwin Feb 13 '22 at 19:46
  • I still cannot find anything in this post that answers the question. Could you indicate what the principles and axioms are that you name and how they apply? – whuber Feb 13 '22 at 19:49
  • @Whuber I name no principles because axioms come first. Don't they? The axiom which applies here is that given a random choice of two outcomes, the probability of either is 1/2. How does that not work for you? – Robbie Goodwin Feb 13 '22 at 19:59
  • That sounds like Laplace's Rule of Insufficient Reason. But neither the nature of randomness nor the actual probabilities of the coin are in question. I do not see how it illuminates the question of convergence of frequencies. – whuber Feb 13 '22 at 21:24
  • @Whuber I Posted what I did because it seemed clear both the nature of randomness and the actual probabilities were in question. If not, what was the point of the Question? D'you see convergence of frequencies as some other-worldly abstract, or doesn't it not only result from, but pretty-much define the nature of randomness and probability. That definition being by observation, not prescription, changes what? – Robbie Goodwin Feb 15 '22 at 21:58