2

Let's assume the tossing of an unfair coin is modeled by a random variable X taking the values head and tail. You know that the objective probability of the coin showing up head is either $p=0.4$ or $p=0.6$ and that no other value is possible. Further, you have reason to believe that $p=0.6$ is very likely to be true.

Following the Bayesian methodology, you could express your belief by the (subjective) prior distribution

$$\begin{align} P(p=0.6)&=0.9 \\ P(p=0.4)&=0.1\text{,} \end{align}$$

hence treating $p$ as a random variable. Once you do this, marginalization in combination with Lewis's principle implies that your subjective probability for the coin showing up head should be

$$P(X=\text{head})=0.6 \times 0.9+0.4 \times 0.1=0.58\text{,}$$

which is rather inappropriate even as subjective probability, because after all you know that $p$ can either take the value $0.4$ or $0.6$ and hence your subjective probability should either be $0.4$ or $0.6$.

I would like to know how Bayesians respond to that issue. Has this issue been discussed theoretically, and what is the Baysian position towards it?


@Arya McCarthy: Thank you for your answer. I think you are right that Baysians interpret P(X=head)=0.58 as degree of belief, that heads will show up. My question is why does this makes sense, and according to which criterion? I feel this boils down to the justification of the prior. Imagine a Bayesian also has the same prior distribution for other coins as well and that the objective probability of getting a coin with p=0.6 is actually 0.9 and the objective probability of getting a coin with p=0.4 is actually 0.1. If in an experiment, the Baysian selects a series of coins at random and tosses them, $P(X=\text{heads})=0.58$ would be justified, because in this two stage random experiment (first select a coin and then toss it) the overall relative frequency of head will roughly speaking converge to the expected value, which is $0.58$. Hence, roughly speaking, a long run strategy based on $P(X=\text{head})=0.58$ would actually be successful in the real world and this seems to me as a good criterion for P(X=head)=0.58 to make sense, - actual long run success considering similar situations. So how do you arrive at a prior, that roughly reflects these two objective probabilities? You could collect a sample and calculate frequentist estimates. If this is not possible, you should somehow try to guess what these two objective probabilities in a two stage experiment really are. Only then can you interpret $P(X=\text{head})=0.58$ as your subjective believe about the long run frequency in a two stage experiment and can argue that you believe this leads to the best long run strategy. That being said, in my initial example, if I had to make a single bet, I personally would base my action on $p=0.6$, because I am not interested in the long run here. This does not mean $P(X=\text{head})=0.58$, which reflects the long run relative frequency if my prior also does.

Alexis
  • 26,219
  • 5
  • 78
  • 131
Jan
  • 39
  • 3
  • 2
    Could you explain what you're trying to ask by "what sense does this make?" – whuber May 13 '21 at 18:34
  • 1
    I updated the question. It is a valid theoretical issue and I hope I made that clear. – Jan May 13 '21 at 19:33
  • 8
    I see no problem with 0.58 as a subjective probability, it is simply taking into account uncertainty over the two possible choices – bdeonovic May 13 '21 at 20:13
  • 1
    If you were in a casino, would you base your next action on P(X=head)=0.58 or on p=0.6 (maximum of the prior) given that you know that p can either be 0.4 or 0.6? If the answer is p=0.6, then what practical meaning has P(X=head)=0.58? – Jan May 13 '21 at 20:52
  • 2
    You'd base it on the first one. (A lot of Bayesian reasoning actually was first described in the context of betting odds.) – Arya McCarthy May 13 '21 at 23:48
  • 1
    Don't forget—you're not betting on the value of $p$. You're betting on the outcome of the sampled coin flip. – Arya McCarthy May 14 '21 at 00:14
  • @ AryaMcCarthy could you please elaborate on that? – Dave May 14 '21 at 00:14
  • 4
    It might be less confusing to use a less loaded letter than $p$ for the coin's parameter. I'll call it $f$. The asker seems concerned that the true $f$ cannot be 0.58. But the Bayesian never claims that the coin's parameter $f$ is 0.58. They're saying that this is their belief about $p(\text{heads})$. – Arya McCarthy May 14 '21 at 00:48
  • 2
    It may also be useful to consider more extreme values for $f$ than 0.4 and 0.6. Consider $P(f=0.99) = 0.9$ and $P(f=0.01) = 0.1$. Behaving as if the probability of heads is 0.99 ignores the 10% chance that it's the much smaller 0.01, but the Bayesian estimate of $P(X=\text{head}) = 0.892$ considers both possibilities in proportion to their prior probability. – jkpate May 14 '21 at 14:32
  • 3
    Editing questions is not meant for discussion with other users. Please do not use it like this. Few edits to improve questions are fine, but not frequent edits. – Tim May 18 '21 at 06:50
  • @Jan I would drop your edit and ask the question again. The reason is that in your edit, you drop marginalization in the second part when you discuss what you would do and instead impose a loss function. Implicitly, you have imposed the "all or nothing" loss function. Although you are not aware of it, that decision has vast mathematical implications for gambling. Arya's answer is correct for your first question. You cannot get from Arya's answer to the first half, to your question in the second half. Also, as you have told us what you would do, it isn't really a question anymore. – Dave Harris May 19 '21 at 19:28
  • [Here](https://stats.stackexchange.com/questions/539351/how-do-bayesians-interpret-px-x-theta-c) is a thread addressing a related question about coin tosses and Bayesian predictive probability. [Here](https://stats.stackexchange.com/a/538616/307000) is a thread comparing a Bayesian predictive probability to a frequentist predictive p-value using a coin toss example. – Geoffrey Johnson Aug 10 '21 at 21:21

3 Answers3

6
  • What you refer to as $P(X=\text{head})$ is in fact an expected value $E(p)$ of a prior distribution that you've chosen for $p$. You noticed correctly that the distribution can take only two values $0.4$ and $0.6$, so $0.58$ is an impossible result for such distribution. There is nothing wrong with this. Bernoulli distribution can take only two values $0$ and $1$, and its expected value is a real number in the unit interval. The expected value of a discrete distribution, like binomial, or Poisson can be a real number, no matter that those are distributions for integer-valued things. This has nothing to do with Bayesian statistics, it's how random variables and expected values work.

  • It’s worth pointing out that the expected value of the prior distribution for $p$ is not the “subjective probability”, it’s just a numerical summary of the prior distribution that you assumed for the parameter $p$. This summary may, or may not make sense to describe the distribution. Your prior is the whole distribution, not its expected value.
    If your prior probability distribution is

    $$ \begin{align} P(p=0.6)&=0.9 \\ P(p=0.4)&=0.1\text{,} \end{align} $$

    and based on this you need to make a single bet, then you could simply bet on $0.6$ as it has a higher probability. There is no reason why you would bet for the impossible $0.58$ average. Expected value just tells you what is $p$ on average, nothing more than this.

  • This choice of the prior for $p$ is rather strange, but I understand that this is a made-up example. As you noticed, the distribution can take only two values. If you plug in the prior into the Bayes theorem

    $$ P(A|B) = \frac{P(B|A)\,P(A)}{P(B)} $$

    the result would also need to be defined in terms of those two values. The more usual choice of a prior for probability $p$ would be a distribution over the whole range of possible values for probabilities, e.g. uniform over $(0, 1)$, or beta distribution. Such distributions assign non-zero probabilities to all the possible values of $p$, so you could update the prior with the data (through the likelihood), and come up with an estimate. Usually, you have no way of knowing for certain that the parameter is a particular real number, so both such prior makes more sense, and problems, as described by you, don't arise.

  • If you collect the data, the more data you have, the closer the result would be to what you observed in the data. Given that you used a reasonable prior (that assigns non-zero probabilities to the possible values of the parameter), this methodology enables you to estimate the distribution of possible values of the parameters given the data and the prior.

  • What is probability? Bruno de Finetti once said that "probability does not exist". Probability is an abstract, mathematical concept. There are no "probabilities" in the wild. If you toss a coin, an unlimited number of different factors make it hard to predict. Each of those factors alone can be described in terms of laws of physics, but together they form a chaotic system. We assign a probability to simplify all those factors to a single number, but there is no "probability" that is responsible for the result of the coin toss.

    Bayesians reason in terms of subjective probabilities, the probability is a number between zero and one that measures how much I believe in something. If I needed to make a bet, the number would help me quantify the amount of the bet that would be reasonable based on my assumptions.

Tim
  • 108,699
  • 20
  • 212
  • 390
4

Expanding my comment into an answer: I don't see this as inappropriate at all. To call it inappropriate is analogous to saying that the the expected value of a die being 3.5 is inappropriate—after all, a 3.5 can never come up!

It might be less confusing to use a less loaded letter than $p$ for the coin's parameter. I'll call it $f$.

You seem concerned that the true $f$ cannot be 0.58. But the Bayesian never claims that the coin's parameter $f$ is 0.58. They're saying that this is their belief about $p(\text{heads})$. The belief incorporates uncertainty about $f$.

Don't forget—you're not betting on the value of $f$. You're betting on the outcome of the sampled coin flip. Believing that $p(\text{heads})$ is 0.58 lets you hedge your bets. In fact, de Finetti explained the concept of subjective probabilities in the context of betting odds. It's exactly what you're considering here: rather than treating $p(\text{heads})$ as 0.6 or 0.4, you use your uncertainty about $f$ to alter how confidently you'd bet on heads.

Arya McCarthy
  • 6,390
  • 1
  • 16
  • 47
  • 1
    Thank you for your answer. I think you are right that Baysians interpret P(X=head)=0.58 as degree of believe, that head will show up. See my longer post below. – Jan May 16 '21 at 09:09
4

This is not really an objection to Bayesian analysis at all --- it is an objection to the rules of probability theory. If you have an event $\mathcal{E}$ (e.g., a head occurring on a coin) and a discrete parameter $p$ giving the conditional probability of the event, then the law of total probability states that:

$$\mathbb{P}(\mathcal{E}) = \sum \mathbb{P}(\mathcal{E}|p) \pi(p) = \sum p \cdot \pi(p).$$

As can be seen from this equation, the marginal probabilty of the event is a convex combination of the conditional probabilities given by the possible values of the parameter $p$. Your question asserts that the marginal probability of the event should be equal to one of the possible parameter values, but that is not always true. In your example, there is nothing wrong with the conclusion that the marginal probability of a head is $0.58$ --- that is the value that follows from the stipulated prior for $p$ and the likelihood function for the outcome of the coin.

As to a response, Bayesians would have the same response as any other practitioners who use probabilty theory as the basis for statistics. If you are unwilling to accept the law of total probability theory then you are rejecting one or more of the axioms of probability, and the onus would then be on you to stipulate an alternative theory of probability to take its place.

Ben
  • 91,027
  • 3
  • 150
  • 376
  • Thank you for your answer. It is only Bayesians, who treat the fixed constant p as a random variable (frequentists do not). It is only this treatment, which makes it possible to apply the law of total probability to the situation. A frequentist would not agree with the equation you presented. I am not questioning the axioms of probability here, only the treatment of fixed constants as random variables. – Jan May 15 '21 at 17:20
  • @Jan but, as noticed correctly by Ben, there is nothing "Bayesian" about your example. It is just a probability distribution and you are calculating the expected value of it, that appears outside the possible values of the random variable. Same thing applies to Bernoulli distribution, so you could argue that if Frequentist estimates the mean of the coin tosses to be 0.513, the number does not make sense. – Tim May 15 '21 at 19:58
  • 1
    @Jan: As Tim notes, this is something that also happens in non-Bayesian applications. Many statistical paradigms use the law of total probability, so if you are unwilling to accept values computed from this rule, that is something that will affect more than just Bayesian analysis. – Ben May 16 '21 at 09:22
  • As I said, it is only Bayesians, who treat the fixed constant $p$ as a random variable, i.e. as a function $p: \Omega \rightarrow [0,1]$, and only then does $p=0.6$ represents an event, i.e. a subset of $\Omega=\{head,tail\}$, i.e. $\{p=0.6\}=p^{-1}(0.6)=\{\omega \in \Omega : p(\omega)=0.6\}$. The law of total probability applies only to events or subsets of $\Omega$ in this case and not to constants. In the described single stage experiment, no frequentists would agree with the application of the law of total probability. – Jan May 16 '21 at 10:40
  • 2
    @Jan this example has *nothing* to do with Bayesian statistics, it does not even involve Bayes theorem. It uses basic laws of probability. This is how probability theory works. Average human has between one and two legs, this is how expected values work nonetheless that the results does not have much to do with human biology. – Tim May 16 '21 at 11:52
  • @Tim: It might help you to consider a *two*-stage experiment were we first choose from two *different* types of coins with objective probabilities $p_1=0.4$, $p_2=0.6$ and then toss: $\Omega:=\{p_1,p_2\}\times\{head,tail\}$, $X:\Omega\rightarrow\{head,tail\}$, $p:\Omega\rightarrow [0,1], \omega \mapsto \pi_1(\omega)$, $i=1,2$ ($\pi_1$ is the projection to the first component), $\mathbb{P}(X=head|p=p_i):=p_i$, $i=1,2$, $\mathbb{P}(p=p_1):=0.1$, $\mathbb{P}(p=p_2):=0.9$. Here $p$ models the selection of the coin. Then $\mathbb{P}(X=head)=\sum_i\mathbb{P}(X=head|p=p_i)\mathbb{P}(p=p_i)=0.58$. – Jan May 16 '21 at 12:06
  • @Tim: We disagree. – Jan May 16 '21 at 12:07
  • @Tim: I am not questioning the interpretation of the expected value as "middel". It is OK that the average human has 1.99 legs. However, the interpretation as "middel" does not justify the *exact* identification of the probability of heads with the expected value. Bayesians say P(X=head)=0.58 (exactly). Frequentists say that the mean of given data is an *approximation* of P(X=head). – Jan May 16 '21 at 12:17
  • @Jan I made edits in my answer to comment on that. – Tim May 16 '21 at 12:32