8

I have seen the post Bayesian vs frequentist interpretations of probability and others like it but this does not address the question I am posing. These other posts provide interpretations related to prior and posterior probabilities, $\pi(\theta)$ and $\pi(\theta|\boldsymbol{x})$, not $P(X=x|\theta=c)$. I am not interested in the likelihood as a function of the parameter and the observed data, I am interested in the interpretation of the probability distribution of unrealized data points.

For example, let $X_1,...,X_n\sim Bernoulli(\theta)$ be the result of $n$ coin tosses and $\theta\sim Beta(a,b)$ so that $\pi(\theta|\boldsymbol{x})$ is the pdf of a $Beta(a+\sum x,b + n - \sum x)$.

How do Bayesians interpret $\theta=c$? $\theta$ of course is treated as an unrealized or unobservable realization of a random variable, but that still does not define or interpret the probability of heads. $\pi(\theta)$ is typically considered as the prior belief of the experimenter regarding $\theta$, but what is $\theta=c$? That is, how do we interpret a single value in the support of $\pi(\theta)$? Is it a long-run probability? Is it a belief? How does this influence our interpretation of the prior and posterior?

For instance, if $\theta=c$ and equivalently $P(X=1|\theta=c)=c$ is my belief that the coin will land heads, then $\pi(\theta)$ is my belief about my belief, and in some sense so too is the prior predictive distribution $P(X=1)=\int\theta\pi(\theta)d\theta=\frac{a}{a+b}$. To say "if $\theta=c$ is known" is to say that I know my own beliefs. To say "if $\theta$ is unknown" is to say I only have a belief about my beliefs. How do we justify interpreting beliefs about beliefs as applicable to the coin under investigation?

If $\theta=c$ and equivalently $P(X=1|\theta=c)=c$ is the unknown fixed true long-run probability for the coin under investigation: How do we justify blending two interpretations of probability in Bayes theorem as if they are equivalent? How does Bayes theorem not imply there is only one type of probability? How are we able to apply posterior probability statements to the unknown fixed true $\theta=c$ under investigation?

The answer must address these specific questions. While references are much appreciated, the answers to these questions must be provided. I have provided four Options or proposals in my own solution below as an answer, with the challenges of interpreting $P(X=x|\theta=c)$ as a belief or as a long-run frequency. Please identify which Option in my answer most closely maps to your answer, and provide suggestions for improving my answer.

I am not writing $P(X=x|\theta=c)$ to be contemptuous. I am writing it to be explicit since $P(X=x|Y=y)$ is not the same thing as $P(X=x|Y)$. One might instead be inclined to write in terms of a sample from the prior and use an index of realizations of $\theta$. However, I do not want to present this in terms of a finite sample from the prior.

More generally, how do Bayesians interpret $P(X=x|\theta=c)$ or $P(X\le x|\theta=c)$ for any probability model and does this interpretation pose any challenges when interpreting $P(\theta=s|\boldsymbol{x})$ or $P(\theta\le s|\boldsymbol{x})$?

I've seen a few other posts tackle questions about Bayesian posterior probability, but the solutions aren't very satisfying and usually only consider a superficial interpretation, e.g. coherent representations of information.

Related threads:

Examples of Bayesian and frequentist approaches giving different results

Bayesian vs frequentist interpretations of probability

UPDATE: I received several answers. It appears that a belief interpretation for $P(X=x|\theta=c)$ is the most appropriate under the Bayesian paradigm, with $\theta$ as the limiting proportion of heads (which is not a probability) and $\pi(\theta)$ representing belief about $\theta$. I have amended Option 1 in my answer to accurately reflect two different belief interpretations for $P(X=x|\theta=c)$. I have also suggested how Bayes theorem can produce reasonable point and interval estimates for $\theta$ despite these shortcoming regarding interpretation.

Geoffrey Johnson
  • 2,460
  • 3
  • 12
  • You have that $X\sim Bern(\theta)$. The probability of success $\theta$ can take values inside the interval $[0,1]$. So, you might believe that, "my friend is really lucky and he always tosses heads", so you want to express that $\theta$ will take large values i.e $0.7-0.9$. A natural way to do that is to let $\theta$ be a random variable, and now because it is a random variable you can assign it a distribution concentrated at $0.7-0.9$, i.e.$\theta\sim \pi(\theta)=Beta(\theta;1,0.25)$. All this logical procedure happened before observing any data, that's why $\pi(\theta)$ is your prior belief – Fiodor1234 Aug 06 '21 at 15:13
  • Thanks, [@Fiodor1234](https://stats.stackexchange.com/users/208406/fiodor1234)! So $\theta$ can take values in [0,1] and you provide some examples, 0.7 and 0.9. How then would I interpret the value 0.7 or 0.9, the probability of heads? What does 0.7 represent? From your description it sounds like 0.7 is the proportion of heads in the limit as the number of tosses tends to infinity. – Geoffrey Johnson Aug 06 '21 at 15:21
  • 3
    What does the new title mean?? The notation makes little sense and is ungrammatical to boot. – whuber Aug 06 '21 at 16:53
  • @whuber, under the Bayesian paradigm $\theta$ is considered a random variable that has a support. $\theta^*$ is one such value in the support. It is equivalent to writing $P(X=x|Y=y)$. If you prefer I use something other than $\theta^*$ to refer to a value in the support for $\theta$ please advise. – Geoffrey Johnson Aug 06 '21 at 17:02
  • Thanks @Tim! I edited my question to make it clear that these other posts interpret prior and posterior probabilities, not data probabilities. My question concerns the interpretation of data probabilities and whether this poses a challenge when interpreting posterior probabilities. If you have any suggestions for making this even clearer please let me know. – Geoffrey Johnson Aug 06 '21 at 17:32
  • 3
    I think you’re misreading the other threads. They discuss how Bayesians interpret probability, any probability. There’s no distinction between data probability and other probability. – Tim Aug 06 '21 at 18:01
  • Hi @Tim. The other threads emphasize interpretation of posterior probability as the belief of the experimenter or knowledge or plausibility of a hypothesis concerning a parameter. I have provided an answer to my own question below and I seem to run into trouble if I interpret $P(X=x|\theta=c)$ in the same way or as a long-run probability. The other questions and answers in the other threads do not address this challenge. If they do and I am somehow missing it please direct me to a specific answer that addresses my question. – Geoffrey Johnson Aug 06 '21 at 18:14
  • 4
    There’s no many different probabilities. It would be logically awkward to assume that you could multiply “long run” probability by subjective probability and get something meaningful. As other threads discuss, Bayesians interpret probability as “degree of belief”. If you feel that other threads don’t discuss that, edit your question to make it explicit that you ask only about likelihood and if it differs from prior or posterior in interpretation. – Tim Aug 06 '21 at 18:26
  • Hi @Tim, my question did not get a fair start and has a -1 vote from someone who misinterpreted my initial question. This will discourage other users from even viewing it. Is it possible to start the vote over at zero? – Geoffrey Johnson Aug 06 '21 at 19:08
  • 7
    Users are free to cast whatever votes they want and nobody has power of reversing them. – Tim Aug 06 '21 at 19:09
  • @Tim, OK. So how do I get a fresh start now that the question is more clearly written and understood by the moderators? The down vote was for the earlier version that some people (including the downvoter) may not have understood or appreciated. Should I delete my question and re-post? – Geoffrey Johnson Aug 06 '21 at 19:14
  • 2
    If people find the question to be clearer now, then they can upvote it. I think you’re reading a bit too much into it to say they’ll read it less if it’s downvoted. As an extreme analogy, it’s like saying people will look at cars less if they’ve been in an accident. – Arya McCarthy Aug 06 '21 at 22:19
  • @GeoffreyJohnson *"if $P(X=1|θ=c)=c$ is my belief that the coin will land heads$"* that is only the case if $\pi(\theta)$ is a delta function centered on $c$. In general a Bayesian's belief that the coin will land heads is given by marginalizing over the nuisance parameter, $\theta$, so for the two to agree, the prior would have to be that delta function. The question seems to be predicated on a misunderstanding of Bayesian statistics. – Dikran Marsupial Aug 25 '21 at 10:36
  • @Dikran Marsupial, see Option 2 and Option 3 in the answer I provided. – Geoffrey Johnson Aug 25 '21 at 17:36
  • @GeoffreyJohnson I have seen them, that does not address the misunderstanding in your question, "P(X=1|θ=c)=c" is essentially meaningless, as others have pointed out. – Dikran Marsupial Aug 25 '21 at 17:40
  • You have stated in your own comments and answer that $P(X=1|\theta=c)=c$ is a long-run probability. You have even referred to this as a "true" probability. If $P(X=1|\theta=c)=c$ is meaningless, then $\pi(\theta|\boldsymbol{x})$ is also meaningless. – Geoffrey Johnson Aug 25 '21 at 17:45
  • Is this question fundamentally about why/how the belief based Bayesian interpretations (subjective probabilities) can/must quantitatively correspond to the results of the frequentist interpretation (at least under conditions where there is a fequentist interpretation). – Dave Aug 25 '21 at 18:07
  • Hi Dave, I did not intend the question to be about how or why Bayesian results correspond to frequentist results, though I address this in my own answer below. The question is about about interpreting the data distribution in conjunction with the prior and posterior distributions. – Geoffrey Johnson Aug 25 '21 at 18:11
  • @GeoffreyJohnson "You have stated in your own comments and answer that P(X=1|θ=c)=c is a long-run probability." **That is a misrepresentation of what I wrote**. What I wrote was that c might be the true (physical) probability that frequentists equate with long run frequencies ( "https://stats.stackexchange.com/questions/539351/how-do-bayesians-interpret-px-x-theta-c-and-does-this-pose-a-challenge-whe/541308?noredirect=1#comment994173_539352" ) . – Dikran Marsupial Aug 26 '21 at 06:59
  • " P(X=1|θ=c)=c is meaningless, then π(θ|x) is also meaningless." no, because "P(X=1|θ=c)=c " is not part of the computation of π(θ|x), which doesn't involve knowing the true (physical) probability. If we knew $c$, no inference would be necessary, either for the Bayesian or the frequentist. – Dikran Marsupial Aug 26 '21 at 07:03
  • If you were asking about the interpretation of $P(X=x|\theta)$, that would be straight forward. It is the setting $\theta = c$ that is causing the problem, because you clearly misunderstand what that means if you think it means our belief in flipping the coin giving a head, because it doesn't. – Dikran Marsupial Aug 26 '21 at 07:08
  • According to Bayesians, $\theta$ is a random variable and so too is $P(X=x|\theta)$. I am asking you to interpret a single value in the support of $\theta$, namely $\theta=c$. – Geoffrey Johnson Aug 26 '21 at 13:34
  • @geoffreyJohnson, I would view $\theta$ as a parameter of a model. Saying $\theta = c$ is just setting the parameter of the model to it's true value. I don't see what more interpretation is required than that? I'm beginning to think that perhaps coin flipping is perhaps not the best example as the $P(X=x|\theta) = \theta$, which is not the case for most models. – Dikran Marsupial Aug 26 '21 at 13:45
  • We can talk about $\theta=c$ being the unknown fixed true limiting proportion of heads as the number of flips tends to infinity for the coin under investigation. When Bayesians use Monte Carlo simulation they sample from the prior. We can also talk about a single realization from the prior, say $\theta=d$. I'm asking for an interpretation of a value in the support of $\theta$. – Geoffrey Johnson Aug 26 '21 at 13:52
  • 5
    @GeoffreyJohnson, sorry again you are forcing my answer into a framework of long run frequencies and in the Bayesian case, doing so by introducing sampling. I've given you the interpretation. It is a model parameter; \pi(\theta) gives the prior relative plausibility of particular values for that model parameter. The data are not generated by a model, they are generated by physics. – Dikran Marsupial Aug 26 '21 at 13:59

11 Answers11

13

I have posted a related (but broader) question and answer here which may shed some more light on this matter, giving the full context of the model setup for a Bayesian IID model.

You can find a good primer on the Bayesian interpretation of these types of models in Bernardo and Smith (1994), and you can find a more detailed discussion of these particular interpretive issues in O'Neill (2009). A starting point for the operational meaning of the parameter $\theta$ is obtained from the strong law of large numbers, which in this context says that:

$$\mathbb{P} \Bigg( \lim_{n \rightarrow \infty} \frac{1}{n} \sum_{i=1}^n X_i = \theta \Bigg) = 1.$$

This gets us part-way to a full interpretation of the parameter, since it shows almost sure equivalence with the Cesàro limit of the observable sequence. Unfortunately, the Cesàro limit in this probability statement does not always exist (though it exists almost surely within the IID model). Consequently, using the approach set out in O'Neill (2009), you can consider $\theta$ to be the Banach limit of the sequence $X_1,X_2,X_3$, which always exists and is equivalent to the Cesàro limit when the latter exists. So, we have the following useful parameter interpretation as an operationally defined function of the observable sequence.

Definition: The parameter $\theta$ is the Banach limit of the sequence $\mathbf{X} = (X_1,X_2,X_3,...)$.

(Alternative definitions that define the parameter by reference to an underlying sigma-field can also be used; these are essentially just different ways to do the same thing.) This interpretation means that the parameter is a function of the observable sequence, so once that sequence is given the parameter is fixed. Consequently, it is not accurate to say that $\theta$ is "unrealised" --- if the sequence is well-defined then $\theta$ must have a value, albeit one that is unobserved (unless we observe the whole sequence). The sampling probability of interest is then given by the representation theorem of de Finetti.

Representation theorem (adaptation of de Finetti): If $\mathbf{X}$ is an exchangeable sequence of binary values (and with $\theta$ defined as above), it follows that the elements of $\mathbf{X}|\theta$ are independent with sampling distribution $X_i|\theta \sim \text{IID Bern}(\theta)$ so that for all $k \in \mathbb{N}$ we have: $$\mathbb{P}(\mathbf{X}_k=\mathbf{x}_k | \theta = c) = \prod_{i=1}^k c^{x_i} (1-c)^{1-x_i}.$$ This particular version of the theorem is adapted from O'Neill (2009), which is itself a minor re-framing of de Finetti's famous representation theorem.

Now, within this IID model, the specific probability $\mathbb{P}(X_i=1|\theta=c) = c$ is just the sampling probability of a positive outcome for the value $X_i$. This represents the probability of a single positive indicator conditional on the Banach limit of the sequence of indicator random variables being equal to $c$.

Since this is an area of interest to you, I strongly recommend you read O'Neill (2009) to see the broader approach used here and how it is contrasted with the frequentist approach. That paper asks some similar questions to what you are asking here, so I think it might assist you in understanding how these things can be framed in an operational manner within the Bayesian paradigm.

How do we justify blending two interpretations of probability in Bayes theorem as if they are equivalent?

I presume here that you are referring to the fact that there are certain limiting correspondences analogous to the "frequentist interpretation" of probability at play in this situation. Bayesians generally take an epistemic interpretation of the meaning of probability (what Bernardo and Smith call the "subjective interpretation"). Consequently, all probability statements are interpreted as beliefs about uncertainty on the part of the analyst. Nevertheless, Bayesians also accept that the law-of-large-numbers (LLN) is valid and applies to their models under appropriate conditions, so it may be the case that the epistemic probability of an event is equivalent to the limiting frequency of a sequence.

In the present case, the definition of the parameter $\theta$ is the Banach limit of the sequence of observable values, so it necessarily corresponds to a limiting frequency. Probability statements about $\theta$ are therefore also probability statements about a limiting frequency for the observable sequence of values. There is no contradiction in this.

Ben
  • 91,027
  • 3
  • 150
  • 376
  • Thank you! I will be certain to read the references. Am I understanding correctly that the Bayesian interpretation for probability statements of $X$ given $\theta=c$ is a long-run probability interpretation? Am I also understanding correctly that the Bayesian sees no problem blending two different interpretations of probability in Bayes theorem (one for $X$ the other for $\theta$) as if they are equivalent or compatible? – Geoffrey Johnson Aug 11 '21 at 11:24
  • I have added another second interpretation under Option 1 in my own answer. Is this how I should interpret the posterior if probability statements about the data have a long-run probability interpretation? – Geoffrey Johnson Aug 11 '21 at 17:00
  • 2
    The parameter $\theta$ is giving the long-run proportion of positive indictors, so it corresponds to the long-run probability interpretation. Bayesians usually adopt an epistemic interpretation of probability (see e.g., Bernardo and Smith, though they call it the "subjective" interpretation), but we also accept the laws-of-large-numbers so we accept that there is often a correspondence between probability and long-run frequency of an event. Contrary to your own answer, this is not really an "optional" interpretation --- once you have a model where LLN applies, it applies. – Ben Aug 11 '21 at 23:00
  • Hi Ben, thanks for your help. I don't mean option as "optional" just a proposal using my intuition. – Geoffrey Johnson Aug 12 '21 at 00:01
  • 1
    I think you are missing a lot of context here, which makes it hard. I recommend looking at the post linked in this question and also reading the cited material. This will give you much more context on how Bayesians set up an IID model from foundational assumptions. – Ben Aug 12 '21 at 00:12
  • In your limit is $\theta$ fixed? Are the $X_i$'s coming from $f_{\theta}$ or from the prior predictive distribution? – Geoffrey Johnson Aug 12 '21 at 01:26
  • Hi Ben, I have ammended my question. I'm hoping you can address the specific points in bold in your answer. – Geoffrey Johnson Aug 12 '21 at 02:08
  • 3
    @GeoffreyJohnson: I think you are getting ahead of yourself here. I recommend reading the linked material and getting a feel for the setup of the Bayesian IID model before asking additional questions. Some of your questions are unclear, largely because they are divorced from the proper context of these models. You might find that the linked works assist in understanding some of the setup of these models, which may assist in framing further questions. – Ben Aug 12 '21 at 03:50
6

IMO you can find equally hard-to-answer philosophical questions like this about the foundations of any branch of maths, not just Bayesian statistics: what does it really mean, deep down, to say that 1+1=2? Not sure I could answer that satisfactorily, but I still confidently use arithmetic.

But my interpretation is: we don't need to think of $\theta$ as a long-run probability. It's the number with the property that if my utility function is linear, then I should be indifferent to the opportunity to pay $\theta$ for a contract which pays out $1$ if the next flip of the coin is heads and 0 if it is tails.

fblundun
  • 3,732
  • 1
  • 5
  • 18
  • But gow can you then be sure that your probabiliy $\theta$ is well-calibrated? – kjetil b halvorsen Aug 17 '21 at 15:08
  • @kjetilbhalvorsen I don't think it's generally possible to prove that a prior was sensible. (Though there is some accountability in that if you disagree with my prior, we can make a bet which is favourable to me according to my prior and to you according to yours.) – fblundun Aug 18 '21 at 13:29
5

More generally, how do Bayesians interpret P(X=x|θ=c) or P(X≤x|θ=c) for any probability model and does this interpretation pose any challenges when interpreting P(θ=s|x) or P(θ≤s|x)?

$P(X=x\vert \theta=c)$ is the degree of belief ascribed to the outcome $X=x$ conditioned on the fact that $\theta =c$ under the model represented by $P$.

$P(\theta =s \vert X=x)$ is the degree of belief ascribed to the outcome $\theta=s$ conditioned on the fact that $X=x$ under the model represented by $P$.

$P(\Theta = \theta)$ is the degree of belief ascribed to the outcome $\Theta=\theta$ under the model represented by $P$.

$P(\Theta)$ is the degree of belief ascribed to the outcome $\Theta=\theta$, for each $\theta \in \Theta$, under the model represented by $P$

(and so on...)

"Degree of belief" can be operationalized in terms of betting/preference behavior (as mentioned in other answers). More abstractly, Bayesian probability theory is a formalization of aspects of how people reason under uncertainty (at least some of the time...), so it is, itself, a model for belief.

How do we justify blending two interpretations of probability in Bayes theorem as if they are equivalent? How does Bayes theorem not imply there is only one type of probability? How are we able to apply posterior probability statements to the unknown fixed true θ=c under investigation?

I believe the first two questions miss a key point: more than one category of phenomena can be modeled using the same mathematics. In this case both (an idealization of) belief and (an idealization of) repeatable trials can both be represented by the same mathematical formulation: probability theory. Good/useful/accurate beliefs about repeatable trials will assign "degrees of belief" for an outcome $x$ that are equal to the proportion of occurrences for the outcome $x$ in the ensemble of trials (basically by Dutch Book arguments), so under those circumstances you get a numerical correspondence between these two different aspects of the world.

For the third question, these statements are "the degree of belief ascribed to the outcome...".

Dave
  • 3,109
  • 15
  • 23
  • Thank you! This appears to map to Option 2 in my answer. Would you agree? In that case the questions about justifying blending two interpretations are not applicable. These questions are only applicable if we interpret $\theta=c$ as an unknown fixed constant. How do we justify beliefs about beliefs as being applicable to the coin under investigation? – Geoffrey Johnson Aug 26 '21 at 19:55
  • After all, $P(X=1|\theta)=\theta$ or $P(X=1|\theta=c)=c$. – Geoffrey Johnson Aug 26 '21 at 20:02
  • @GeoffreyJohnson yes, but without the "belief about belief" stuff -- there are just beliefs about propositions, 'that X takes on the value x' etc., I'm not sure what you're getting at with "interpret $\theta=c$ as an unknown fixed constant". – Dave Aug 26 '21 at 20:03
  • Thanks again. I mean that if $P(X=1|\theta=c)=c$ is not a belief, but the limiting proportion of heads as the number of flips tends to infinity for the coin under investigation, an unknown fixed constant. – Geoffrey Johnson Aug 26 '21 at 20:05
  • If the proposition $\theta=P(X=1|\theta)$ is a belief, how is the proposition $P(\theta \le s)=P( P(X=1|\theta) \le s)$ not a belief about a belief? I'm not trying to be difficult, I just want to know how to respond to such a question. – Geoffrey Johnson Aug 26 '21 at 20:10
  • 2
    @GeoffreyJohnson your question relies on the coincidental numerical correspondence between the value of parameter $\Theta$ and the degree of belief that the model ascribes to the outcome $X=1$ given that particular value for $\Theta$. Though $P(X=1 \vert \theta)$ and $\theta$ are both real numbers, they are in different spaces. This should be obvious for the problems where the analog of $\Theta$ is discrete. – Dave Aug 26 '21 at 20:21
  • @GeoffreyJohnson the domain of $P(\Theta)$ is $\Theta$. $P(\theta < s)$ is a short hand for "sum $P(\Theta=\theta)$ for all $\theta \in \Theta$ such that $\theta – Dave Aug 26 '21 at 20:25
  • Thank you. I have given your answer a +1 vote. – Geoffrey Johnson Aug 26 '21 at 21:39
  • Good answer indeed. Some Bayesians would prefer "state of knowledge" rather than "belief", and there is indeed a distinction. I tend to use both because both subjective Bayes and objective Bayes have their place (IMHO). The mechanics of the analysis are the same for both, but they don't mean quite the same thing. – Dikran Marsupial Aug 27 '21 at 07:08
  • @Dave, does your answer suggest that $\theta=c$ is the limiting proportion for the coin under investigation, and $P(X=1|\theta=c)$ is the belief about the coin landing heads given this limiting proportion? – Geoffrey Johnson Aug 28 '21 at 20:36
  • In his answer Ben refers to $P(X=1|\theta=c)$ as a sampling distribution, while in your answer it is referred to as a belief distribution. I see how they coincide, but this is using two different interpretations of probability and gets at the heart of my original question. Would it be more appropriate for Ben not to refer to this as a sampling distribution when operating under the Bayesian paradigm? – Geoffrey Johnson Aug 28 '21 at 20:48
  • I have amended my Option 2 to have two versions, a) and b). Option 2 a) does not have the "belief about belief" stuff, treating $\theta$ as the limiting proportion of heads (it is not the probability of heads). Option 2 b) treats $\theta$ as the probability of heads, which necessitates the "belief about belief" stuff. – Geoffrey Johnson Aug 28 '21 at 21:09
  • @GeoffreyJohnson P(X=x|θ=c) is the degree of belief ascribed to the outcome X=x conditioned on the fact that θ=c under the model represented by P. If it works out that $c$ is the frequency of heads in some population (or even some theoretical limit of such a frequency under some other model), that does not change the Bayesian interpretation of the conditional probability represented by the expression $P(X=1\vert \theta=c)$. – Dave Aug 30 '21 at 14:15
4

I'm going to go in an entirely different direction from the other replies; I hope this comes across as helpful rather than simply contrary.

I suggest that we don't need to interpret conditional probability (or any other probability) in general. In any actual application, the interpretation of the conditional probability will be forced upon us by the context (in fact, we may be presented with many ways to frame the problem, all with different meanings!).

But without a specific application, the question of interpretation is meaningless. The reason your example seems difficult to interpret is because it is critically underspecified - the problem does not give us any real-world scenario that we're trying to reason uncertainly about. It's not surprising, in such a situation, that it's hard to resolve what exactly we mean by uncertain knowledge.

user3716267
  • 614
  • 3
  • 11
  • Thank you! I had wondered about this direction myself. In the coin toss example I would say we are reasoning about the limiting proportion of heads as the number of flips tends to infinity for the coin under investigation. Another application would be a patient population. Do conditional probability statements about the population represent the unknown true relative frequencies? My belief about these frequencies? The limiting proportions as the number of samples from this population tends to infinity? The last one is the only one that works well for both examples. – Geoffrey Johnson Aug 12 '21 at 16:41
  • If we leave it too nebulous I feel it starts to lose its meaning or purpose. – Geoffrey Johnson Aug 12 '21 at 16:42
  • 3
    A thing to keep in mind here is that the map is not the territory. The coin in our model does not really exist; no real coin behaves in such a simple manner. The "true relative frequency" is a model-relative fiction. – user3716267 Aug 13 '21 at 02:02
  • 1
    While I can appreciate this level of philosophizing, if we go that route there may be no point in talking about anything in statistics, Bayesian, frequentist, or otherwise. – Geoffrey Johnson Aug 13 '21 at 23:44
  • [Here](https://www.youtube.com/watch?v=UCmPmkHqHXk) is a fun youtube video that gives hope to the idea of a "true relative frequency." – Geoffrey Johnson Aug 13 '21 at 23:51
  • 1
    I disagree. There's plenty of point in talking about these things, inasmuch as models that involve them are useful. It's just pointless to seek much interpretation past that. The [reference class problem](https://en.wikipedia.org/wiki/Reference_class_problem) pretty much kills any hope for a single unambiguous interpretation of probability, anyway. – user3716267 Aug 14 '21 at 02:47
  • This is generally taken as the argument against Bayesian inference in favor of the p-value since there are any number of different priors one could consider, informative and non-informative, while the set of all repeated experiments forms a single reference class. – Geoffrey Johnson Aug 14 '21 at 16:17
  • 2
    In my experience, it's "Bayesians" who are generally eager to point out the reference class problem as a critique of frequentism. Both are mistaken; there's no getting away from the reference class problem in either direction. The frequentist scenario is as much a useful fiction as the Bayesian one (we never really repeat experiments perfectly in practice...). P-values are fraught for a whole bunch of unrelated reasons, and should be used approximately nowhere. – user3716267 Aug 14 '21 at 17:06
  • 1
    You might find this article interesting: https://link.springer.com/article/10.1007/s10670-017-9936-9 – user3716267 Aug 14 '21 at 17:09
  • 2
    Another aspect of this is that all real world problems, in fact, deal with finite populations. So some of these $N \rightarrow \infty$ limits are in fact, just abstractions. – Dave Aug 25 '21 at 18:12
4

I think that the issue here is whether the likelihood that is used to turn a Bayesian prior into a Bayesian posterior has a Bayesian interpretation, or whether it is necessarily a long-run frequency. IF that is the question, then I would say "yes", of course it does.

If we have a Bernoulli likelihood then $f(k;\theta) = \theta^k(1-\theta)^{1-k}$ for $k \in \{0,1\}$, where $\theta$ is a parameter, with unknown "true" probability $c$. Now I would regard $c$ as a "physical probability" (IIRC Good uses that term as well). If we are talking about macroscopic objects, like coins, then there is no such thing as "randomness". Whether a coin comes down heads of tails is entirely deteministic. The only reason we can't know whether it comes down heads or tails is that we don't have perfect knowledge of the initial condtions. So a "physical probability" is basically summarising appearance of randomness that are caused by that lack of knowledge.

It is important to distinguish physical probabilities from either Bayesian probabilities or frequentist ones. It distinct from frequentist probabilities because the long run frequency normally stems directly from the physical probability being (assumed to be) the same for all possible trials. However, they are not directly equivalent. The physical probability is directly the probability that a particular coin flip will come up heads, because it describes the physics that makes that so. A frequentist probability, as a long-run frequency can't be applied to a particular coin flip (as it has no long run frequency, it happens only once). It can only make a probabilistic statement about a (fictitious) population of coin flips, of which this can be considered a particular sample.

Note that something that can only ever happen at most once can have a physical probability of happening, but it can't have a long run frequency.

A physical probability isn't a Bayesian probability either. Bayesian probabilities represent our beliefs (subjective or objective) about the plausibility of different values of that physical probability. The Bayesian is making a probabilistic model of the physics, it isn't the physics itself.

So for me, I would view $f(k;c)$ as neither Bayesian nor frequentist, or perhaps both, but a statement about the physics of the date generating process. It is equally true for single observations, or when looking at a long run of observations.

So in this case, to get our posterior, we would say

$$p(\theta|X=1) = \frac{P(x=1;\theta)p(\theta)}{P(X=1)}$$

Note that $c$ does not appear anywhere in our inference to obtain our posterior belief, which is why I think the question as posed is essentially meaningless.

Of course we could look at our posterior to evaluate $p(c|X=1)$, but this would just give us the point probability of the model parameter, $\theta$, being the same numeric value as the true "physical" probability. As this is evaluating a specific point on a continuous distribution, it is just a density, not an actual probability. Probably not very useful.

At the end of the day, it is a very badly posed question, but that seems to be the most meaningful answer I can give to the most informative intepretation that is reasonably consistent with the question as posed.

Arya McCarthy
  • 6,390
  • 1
  • 16
  • 47
Dikran Marsupial
  • 46,962
  • 5
  • 121
  • 178
3

You play a coin flip game with your friend, but you know that your friend somehow tends to toss heads almost every time. So, you can say something like "Ha I know my friend (prior belief), he always tosses heads, so the probability of him toss heads again will be somewhere between $[0.7,0.9]$", i.e. $\theta$ can be whatever value you want inside that interval.

A natural way for saying that in a probabilistic language, that the probability of success (for your friend) lies inside $[0.7,0.9]$ is to let $\theta$ be a random variable.

Now that $\theta$ is a random variable you can assign it a distribution. But your distribution has to reflect your prior belief, that the probability of success for your friend will be inside the interval $[0.7,0.9]$.

A good choice for distribution for $\theta$ would be a $Beta(a,b)$ distribution (as it takes values inside $[0,1]$ where probabilities also do).

However, this $Beta(a,b)$ distribution must give more attention to values inside $[0.7-0.9]$ which is your prior belief right, that your friend always toss heads.

To do that you can center the distribution around $0.8$ which is the midpoint of the interval $[0.7-0.9]$

You can do that by solving $\frac{a}{a+b}=0.8$ a potential solution for that can be choosing $a=10$ and then $b=2.5$.

So, $\pi(\theta)= Beta(\theta;10,2.5)$ reflect your prior belief about the success probability of your friend that lies inside the interval $[0.7-0.9]$.

Now if you want to say something about as $n$ (the number of samples tends to infinity) then check that the mean of the posterior is

$$Mean = \frac{a+\sum x}{a + \sum x + b +n - \sum x} = \frac{a}{b+n} + \frac{\sum x}{b+n}$$

where for $n\rightarrow \infty$, the mean of your posterior belief $\pi(\theta|x)$ goes to $\bar{x}$

Fiodor1234
  • 1,679
  • 6
  • 15
  • Thank you! It seems you have taken "probability of heads" for granted and jumped to "the prior probability (plausibility) of the probability of heads." You indicated that [0.7, 0.9] is a range of plausible values for $\theta$, the probability of heads. If $\theta$ is in fact 0.82, how do we interpret that number? Does this interpretation pose any challenges when it comes time to interpret the posterior probability (plausibility) of the probability of heads? – Geoffrey Johnson Aug 06 '21 at 16:03
  • 1
    I took it for granted because Bernoulli distribution is defined that way, i.e. $P(X=Heads)=\theta$. If you work on the simulated cases where you know the true value of $\theta$, and let's say that the true value is $\theta=0.82$, then what you can say about the posterior probability is if it under or overestimate the true value $\theta = 0.82$ – Fiodor1234 Aug 06 '21 at 16:10
  • Thanks! So how do we interpret $\theta=0.82$? Does this pose any challenges when interpreting $P(\theta \le r | \boldsymbol{x})$? – Geoffrey Johnson Aug 06 '21 at 16:14
  • 1
    No, it doesn't pose any challenge, because you have a whole posterior distribution $\theta|x$ that you can use to do any kind of inference that you want. You can calculate the mean and compare it with the true value $\theta = 0.82$ you can easily calculate quantiles, for example the $\mathbb{P}(\theta \leq 0.82|x)$. You can do whatever you want pretty easily, I hope that helps :) – Fiodor1234 Aug 06 '21 at 16:20
  • See my proposed interpretations (posted as an answer) and let me know which one is correct, or if you can provide another. – Geoffrey Johnson Aug 06 '21 at 16:41
  • I think the second option is the one that is "almost" correct. But $\pi(\theta)$ is your belief about $\theta$. And the prior predictive distribution, is your data distribution averaged over all $\theta$ based on your prior belief $\pi(\theta)$, i.e. $p(X=1)=\int p(X=1,\theta) d\theta) = \int p(X=1|\theta)\pi(\theta)d\theta = \int \theta \pi(\theta)d\theta$ – Fiodor1234 Aug 06 '21 at 16:49
  • 3
    If you re-post same question it will be closed as duplicate. Re-posting is considered as spamming. – Tim Aug 06 '21 at 19:19
2

I think part of the problem is that there are some notational problems in the question, and a degree of people talking past each other due to having different backgrounds/positions, so I'll go through the question trying to understand what was meant. I will be happy to be corrected if I am wrong and will edit the answer until we understand each other.

The first issue is what does the author mean by $P(X=x|\theta=c)$? I think this is intended to mean the probability that the random variable $X$ has the value $x$ if the parameter of the model, $\theta$ has its "true" value, $c$.

How do Bayesians interpret θ=c, the probability of heads? θ of course is an unrealized or unobservable realization of a random variable,

This is a problematic line for me as $\theta$ is not a random variable, but a parameter of the model. If we knew what $c$ was, we would just set $\theta = c$ and there would be no need for a prior or a posterior. But we don't know the optimal value of the parameter, so what do we do?

The traditional Bayesian approach is to contsruct a prior for the unknown parameter value, $\pi(\theta)$ that represents what we know about the parameter a-priori (which may be very little). If we want to know what values of $\theta$ are plausible, given our prior and our data point, $X = x$, then we use Bayes rule, giving

$p(\theta|X = x) = \frac{P(X = x|\theta)p(\theta)}{P(X=x)}$

Notice I have written $P(X=x|\theta)$ rather than $P(X = x|\theta = c)$. This is because we are not interested in a single number telling us the probability of a head. We want to continue representing our knowledge in the form of a distribution of relative plausibilities of all possible values of $\theta$. Representing knowledge in the form distributions, rather than point values is fairly central to Bayesianism.

IF we wanted to give a single number representing the probability of a head, then we might take the mode of $P(\theta|X=x)$ or the expectation of $\theta$ with respect to $P(\theta|X=x)$. But asking how Bayesians interpret $\theta = c$ seems meaningless, it is just setting a parameter of our model to a particular value.

For instance, if P(X=1|θ=c)=c is my belief that the coin will land heads, then π(θ) is my belief about my belief, and in some sense so too is the prior predictive distribution P(X=1)=∫θπ(θ)dθ=aa+b. To say "if θ=c is known" is to say that I know my own beliefs. To say "if θ is unknown" is to say I only have a belief about my beliefs.

This seems very confused. In the case of flipping a coin (a Bernoulli trial), then $P(X=1|\theta=c) = c$ is a tautology as the parameter of a Bernoulli distribution is the probability that $X=1$, so this equation only holds when the parameter of the distribution is equal to its true value. But we don't know the value of $c$, so Bayesians wouldn't encounter this. $\theta$ is a parameter of a model, $c$ is it's true value, what more could there be?

$P(X=1|θ=c)=c$ is not my belief that the coin will land heads, it is the true probability that it will land heads. It can't be my belief as it relies on me knowing the correct value of the parameter $\theta$, but I don't. This means that "then π(θ) is my belief about my belief," is incorrect, because the premise was incorrect. It is just your belief about the relative plausibilities of different values of the parameter $\theta$.

To say "if θ=c is known" is to say that I know my own beliefs.

No, this would be equivalent to saying that you know the true value of the parameter $\theta$, so it is just saying the prior should be a delta function centered on $c$. It is just a direct statement of your prior belief/state of knowledge.

To say "if θ is unknown" is to say I only have a belief about my beliefs.

Again, this is incorrect because the premise at the start of the paragraph was false. It just means you don't know the true value of parameter $\theta$ so perhaps a flat prior distribution on the interval 0 to 1 would be appropriate (encoded as a Beta distribution for convenience).

I think I'll leave it at that for now, adding more is likely to just be further talking past eachother, so I will wait for @GeoffreyJohnson 's comments/corrections.

Dikran Marsupial
  • 46,962
  • 5
  • 121
  • 178
  • your suggestion maps to Option 3 in the answer I provided. – Geoffrey Johnson Aug 25 '21 at 17:37
  • @GeoffreyJohnson You have not addressed the errors in the question that I have pointed out. It is premature to interpret my position until you have addressed those errors. – Dikran Marsupial Aug 25 '21 at 17:39
  • I have addressed all of your concerns in Options 1 through 4 in my answer. There is not enough room to restate this in a comment. – Geoffrey Johnson Aug 25 '21 at 17:42
  • 1
    No. You claim that P(X=1|θ=c)=c is my belief that the coin will land heads, which simply is not correct. You have not addressed that point. For a Bayesian P(X=1) is my belief that the coin will land heads, which I estimate by marginalising over $\theta$, not by setting it to some particular value. – Dikran Marsupial Aug 25 '21 at 17:45
  • That was one possible interpretation. I did not say it was the correct interpretation, nor the only interpretation. Please read my answer. – Geoffrey Johnson Aug 25 '21 at 17:47
  • 1
    O.K. you are not going to engage with my request to clarify the notation and avoid talking past each other. I was willing to give it a try, it is your choice. – Dikran Marsupial Aug 25 '21 at 17:50
  • Please make a specific request and I will consider it. I see you have talked about my question and my proposed interpretations, but I haven't found any specific requests. – Geoffrey Johnson Aug 25 '21 at 18:02
  • 1
    I already have done, this is the comment section on the answer where I did exactly that. I am going to leave the discussion at this point as it is clear you are not willing to listen. The answer begins "I think part of the problem is that there are some notational problems in the question, and a degree of people talking past each other due to having different backgrounds/positions", which you clearly haven't read. – Dikran Marsupial Aug 25 '21 at 18:09
  • I see you do not like $P(X=1|\theta=c)=c$, but I do not see a request. What notation would you prefer I use for this quantity? – Geoffrey Johnson Aug 25 '21 at 18:12
  • 1
    I have to know what *you* mean by it to know how I would express it. As I said, it is a tautology for a Bernoulli trial - it means nothing AFAICS. "but I do not see a request. " sorry that doesn't wash - read the first paragraph. – Dikran Marsupial Aug 25 '21 at 18:13
  • As I point out in my answer we have two options: i) the limiting proportion of heads, or ii) a belief. In Option 1 and Option 3, $P(X=1|\theta=c)=c$ is the limiting proportion of heads as the number of flips tends to infinity for the coin under investigation, an unknown fixed constant. As I point out in Option 2, the limiting proportion of heads is not a probability. $P(X=1|\theta=c)=c$ is my belief about the coin landing heads in any given through, and I set this belief equal to the long-run proportion. The question then becomes, what is the impact when interpreting the posterior? – Geoffrey Johnson Aug 25 '21 at 18:20
  • 3
    I explained in my answer why $P(X=1|θ=c)=c$ does not represent my belief about the coin landing heads. A Bayesian's beliefs are encoded in distributions, not point values. In this case, a distribution over $\theta$, which induces a distribution on anything that depends on $\theta$. If we wanted to make a point estimate, we might take the most likely value of $\theta$, but that doesn't not represent our full beliefs about the coin. – Dikran Marsupial Aug 26 '21 at 06:42
  • 1
    " I set this belief equal to the long-run proportion" why would you do that if you are interested in Bayesian interpretations? Doing that is ignoring the prior entirely, so there is no impact of that when interpreting the posterior, simply because that is not how Bayesians form the posterior. – Dikran Marsupial Aug 26 '21 at 06:44
  • Ben insisted in his answer that in Bayesian statistics there is only one interpretation of probability - belief. If that is true then Option 2 is the correct interpretation. This does not ignore the prior as the prior is clearly in Option 2. – Geoffrey Johnson Aug 26 '21 at 13:24
  • 1
    I am not Ben. I am trying to find out what *you* mean by your notation and have explained what I think is wrong with it. My comment on Ben's answer was basically agreeing that the notation was wrong "At present this defines θ∗ (in the top section) in a self-referential way, which does not seem to be a helpful definition at all. –" I don't see the point in continually referring me to your answer, when that answer also contains the flawed notation. – Dikran Marsupial Aug 26 '21 at 13:36
  • 1
    If P(X=1|θ=c)=c does not ignore the prior, exactly how does the prior affect/interact with this formula? – Dikran Marsupial Aug 26 '21 at 13:38
  • $\theta^*$ is a specific value, interchangeable with $c$. I have used $\theta^*$ in Option 1 to make it clear that the prior is a collection of $\theta$'s and this particular value was sampled from among the other $\theta$'s. The coin under investigation was sampled from a bag of coins. – Geoffrey Johnson Aug 26 '21 at 13:46
  • The prior interacts with this formula through Bayes theorem. I'm not saying I prefer Option 2, I'm saying that if there is only one interpretation of probability, that is belief, then this is the only possible interpretation for the data distribution and for the prior and posterior. – Geoffrey Johnson Aug 26 '21 at 13:46
  • 1
    Why do you not just write P(X = 1|\theta*\ast)? I don't understand why you don't just ask how we interpret the likelihood. For myself I tend to think of it in generative terms, it is the likelihood of the model generating a head with the parameter set to that value. The equation as you wrote it can't interact with the prior as you have set $\theta=c$. – Dikran Marsupial Aug 26 '21 at 13:49
  • Perhaps you want me to index values in the support of $\theta$. Draw a $\theta_j$, generate an $x$. Draw another $\theta_{j^{'}}$, generate an $x$. Each $\theta_j$ has a numerical value. I'm asking you to interpret this value, not the collection of $\theta_j$'s that form the prior distribution. – Geoffrey Johnson Aug 26 '21 at 13:54
  • 1
    No, I have no idea why you should think that. $\theta$ is a model parameter. $\theta^\ast$ is a particular value that model parameter could take. That is the interpretation. – Dikran Marsupial Aug 26 '21 at 14:01
  • I agree. I am asking what the particular value means. What does it represent? – Geoffrey Johnson Aug 26 '21 at 14:02
  • 2
    It is a probability - it can be of any sort. But in that case, what is the difference between your question and "how can probabilities be interpreted"? – Dikran Marsupial Aug 26 '21 at 14:03
  • This gets at the heart of my question since we seem to be flip flopping between stating: i) $\theta$ is fixed and has a value, ii) $\theta$ is random and it is meaningless to discuss its value, iii) nothing is random, it's all physics, iv) there are three different probabilities - physical, frequentist, and subjective. I was hoping for a single, simple, clear interpretation of all the values in the model. – Geoffrey Johnson Aug 26 '21 at 14:04
  • 1
    " θ is fixed and has a value" I am not flip-flopping. I have said no such thing, except as a possible interpretation of your peculiar notation. – Dikran Marsupial Aug 26 '21 at 14:05
  • Depending on who I ask and what question I ask, I get different answers. If I ask a Bayesian in general to define probability I am told that probability is belief, full stop. This maps to Option 2. If I ask about a cancer sceening test and positive predictive value, I get an interpretation like Option 1 (a). If I ask about a coin toss I get an interpretation like Option 3 where there are two or more different types of probability that are somehow compatible in Bayes theorem. – Geoffrey Johnson Aug 26 '21 at 14:07
  • 1
    Sorry the first three [ (i), (ii) and (iii) ] of those are blatant misrepresentations of what I have written. – Dikran Marsupial Aug 26 '21 at 14:08
  • 3
    "If I ask a Bayesian in general to define probability I am told that probability is belief, full stop." I don't think Jaynes would agree somehow. "This maps to Option 2" your answer has four up-votes and three down-votes, which suggests it has substantial problems, so it probably is not a particularly good idea to view all discussions through the prism of that answer. – Dikran Marsupial Aug 26 '21 at 14:17
  • Thank you for your help. Dave has provided a nice answer that sheds light on the matter. Perhaps you intended the same message and I did not catch it. – Geoffrey Johnson Aug 27 '21 at 14:48
  • Dave's answer (+1) seems to be answering much the question I was suggesting here https://stats.stackexchange.com/questions/539351/how-do-bayesians-interpret-px-x-theta-c-and-does-this-pose-a-challenge-whe/541308?noredirect=1#comment994523_541308 and that it was the problem with the notation (especially the $= c$ bit at the end). – Dikran Marsupial Aug 27 '21 at 16:26
2

Let us think about what you are attempting to ask. If we define $X=x\in\chi, \chi$ being the sample space, as observed, then in Bayesian thought, it is a constant. It is an observable. Instead of using the language of parameters and data, we can think in terms of observables and unobservables. There is no randomness here.

$\theta$ is normally a random variable in the parameter space, but it is now a constant. It is crucial that we know how it became so. It appears from the language of your posting that we are conditioning on it in the likelihood function so that, for our purposes, the likelihood is now $.82^1$. So it is not a random variable either. So it is senseless to talk about a probability when everything is a constant. It would be like discussing the probability that $2+2=4$.

It isn’t impossible to discuss this, but it is difficult for several reasons. First, the interpretation can change depending on the axioms used to derive Bayes rule. For example, if we are conditioning on $\theta=.82$, then de Finetti’s axioms would require a prior with mass only on .82. What if we were using some other axiomatization and the mass of the prior was zero at the point we conditioned it? Cox’s axioms would find that problematic as well. Savage’s might not if we allowed for time inconsistency, though why you would change your mind on the prior and not the likelihood is beyond me.

We also need a better definition of what a constant is. For example, conditioning some parameters on constants is not that unusual in Bayesian thinking. Sometimes you do know one of them. There is another case, though, that wrenches up even the Frequentist toolset.

To give an example, the speed of light is known precisely. However, as distance is now normed against the speed of light, distance is uncertain. We used to measure the speed of light with uncertainty; we now measure distance with uncertainty.

Let us imagine we get out our carefully built scientific equipment and decide to measure out five kilometers for our morning run. Our equipment is accurate to within plus or minus twenty meters. When our device measures five kilometers, we know it is somewhere within 4980 and 5020 meters in reality. It is close enough. If this is part of our measuring, we could condition on it being five kilometers as it is close enough for our purposes. It is also definitely wrong. Because distance is a value in the real numbers, the probability that our actual distance is five kilometers when it registers five kilometers is a measure zero event. Our conditioning is wrong with certainty.

A second issue with this type of conditioning problem is a non-mathematical issue. If, instead, we were running a wrecking ball and hit our intended building, plus or minus twenty meters, we could be hitting the wrong building. At the same time, we have conditioned our uncertainty away. Had our wrecking ball been run by a robot, a la E.T. Jaynes, we would have no way to know our decision process was bad.

On the surface, you may think that would not matter, but de Finetti’s coherence wrecks that idea if we are gambling money. A bad constant could create a Dutch Book.

As I see it, there is no randomness in your problem. We observed the outcome; it is a certainty. We observed the parameter. It is being treated as a certainty. We are being bigoted in our conditioning in that we are saying there is no uncertainty.

What is probability in the face of perfect certainty? What do you mean by your question?

Dave Harris
  • 6,957
  • 13
  • 21
  • Thank you for your answer. Your next-to-last question is precisely my question. When we run $P(X=1|\theta)$ through Bayes Theorem we consider $P(X=1|\theta=c)$ for a host of different values of $\theta=c$ according to $\pi(\theta)$. From the other Dave's answer I am gathering that $\theta$ is the limiting proportion of heads for the coin under investigation (not a probability) so that $P(X=1|\theta=c)$ is my belief in heads if I know this limiting proportion. $\pi(\theta)$ is my belief in this limiting proportion (not my belief in the probability of heads). Option 2 a) in my answer. – Geoffrey Johnson Aug 29 '21 at 12:38
  • I'm not thinking of $P(X=1|\theta=c)$ as the likelihood based on an observed event. I am thinking of $P(X=1|\theta=c)$ as the probability of a future experimental outcome of heads given $\theta=c$. – Geoffrey Johnson Aug 29 '21 at 12:51
  • @GeoffreyJohnson then that is not a Bayesian question. Indeed, for it to be a Bayesian question, you would be interested in $\Pr(\tilde{x}|X=x)$ Your question is neither Frequentist nor Bayesian. It is classical statistics. Classical statistics concerned itself with questions like that. – Dave Harris Aug 29 '21 at 16:40
  • I think what you have written is the posterior predictive distribution. While this may also be of interest, it relies on $P(X=1|\theta)$. If $P(X=1|\theta)$ and $P(X=1|\theta=c)$ have no meaning, how do we ascribe a meaning to $Pr(\tilde{x}|X=x)$? To the frequentist, $P(X=1)$ is synonymous with $P(X=1|\theta)$ and represents the long-run proportion of heads, so my question is not outside the realm of frequentism. – Geoffrey Johnson Aug 30 '21 at 00:33
  • @GeoffreyJohnson I have written the posterior predictive distribution. I do not believe your proposition can have meaning in Bayesian thought because you leave no random variables. I would say that it is in the realm of Frequentism if you are building a test of $\theta=.82$ because you are testing a hypothesis. However, if you are conditioning on it, then you have moved over into classical statistics. You would be asserting it as true rather than testing if it is true. An element of this is that your notation is vague, not a good thing in mathematical notation. – Dave Harris Aug 30 '21 at 03:15
  • Thank you for your insights. – Geoffrey Johnson Aug 30 '21 at 03:21
0

$P(x=1|\theta=c)=c=1-P(x=0|\theta=c)=\theta_i=P(x=1|\theta_i)$

The $\theta$ is the aleatoric uncertainty inherited from the coin, and before we observe any events we can only guess that a certain fraction would be most possible, for instance 0.5, but $\theta_i$ can take very value between 0 and 1. We assume that it is discrete (for simplification) and can only take values from this list containing 11 values: $\theta_i \in$ [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0].


$P(\theta)$

It is a beta distribution, the probability of that probability is high or low given the observations or our belief. It acts as a prior, meaning that $P(\theta)=P(\theta|X)$ for the new observations after we have seen $X$.

Since we assume $\theta$ is discrete, then every $\theta_i$ takes a value/probability representing the probability of $P(x=1|\theta_i)$.

However in frequentist, it is a delta distribution. Say we see three heads in three tossing, then $P(\theta=1)=1$ and 0 elsewhere for $\theta$. Its aleatoric uncertainty is 0. If we observe 3 more events but all with tails, the delta distribution would change with $P(\theta=0.5)=1$ and 0 elsewhere. And $P(\theta=1)$ would also become 0. Its aleatoric uncertainty is 0.5 now.


$P(x=1)=\sum_{i=1}^NP(x=1|\theta_i)P(\theta_i)$

Say we don't know the $P(\theta)$, meaning we don't know every $P(\theta_i)$ or the probability of every probability is 1/11. Then

$\begin{align*} P(x=1)&=\sum_{i=1}^{11}P(x=1|\theta_i)P(\theta_i)=\sum_{i=1}^{11}\theta_iP(\theta_i)\\&=\frac{1}{11} \sum_{i=1}^{11}\theta_i=\frac{1}{11} \sum_{c=0}^{1} P(\theta=c)\\&=\frac{1}{11}\sum [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] =0.5 \end{align*}$


$P(\theta|X)$

Following the above and we observe some data $X$. It is a distribution and for every $\theta_i$,

$\begin{align*} P(\theta_i|X)&=\frac{P(X|\theta_i)P(\theta_i)}{\sum_{i=1}^{11}P(X|\theta_i)P(\theta_i)}\\&=\frac{P(x=1|\theta_i)^k P(x=0|\theta_i)^{N-k}P(\theta_i)}{\sum_{i=1}^{11}P(x=1|\theta_i)^k P(x=0|\theta_i)^{N-k}P(\theta_i)}\\&= \frac{P(x=1|\theta_i)^k (1-P(x=1|\theta_i))^{N-k}P(\theta_i)}{\sum_{i=1}^{11}P(x=1|\theta_i)^k (1-P(x=1|\theta_i))^{N-k}P(\theta_i)} \end{align*}$

where k is the number of observations of heads and N is the total tosses.

After we obtained $P(\theta_i|X)$, say we observed some other data $Z$, then to calculate $P(\theta_i|Z)$ we would treat $P(\theta) = P(\theta|X)$ for every 11 values of $\theta_i$. With enough observations from the initial prior $P(\theta_i) = 1/11$ would vanish.

For $P(\theta\leq s|X)$ you just sum up all the $P(\theta_i|X)$ where $\theta_i$ less than or equal to s.

Lerner Zhang
  • 5,017
  • 1
  • 31
  • 52
  • Thank you Lerner Zhang for your help. I'm not sure I follow when you say, "However in frequentist, it is a delta distribution. Say we see three heads in three tossing, then P(θ=1)=1 and 0 elsewhere for θ. Its aleatoric uncertainty is 0." Is this to say that since the point estimate $\hat{\theta}=\bar{x}=1$ based on a sample of $n=3$, the coin will always land heads i.e. $\theta=1$? Or are you suggesting something else, perhaps that among only those three throws the coin landed heads? Is $\theta$ then defined only for a finite observed sample? – Geoffrey Johnson Aug 12 '21 at 15:02
  • @GeoffreyJohnson I mean we predict the $\theta$ takes 1 from the 11 values with probability 1 and the other 10 probabilities all with probability 0. – Lerner Zhang Aug 12 '21 at 15:04
  • I would think the values for $\theta_i$ would match the support for the prior distribution. In this case the continuous support for the Beta distribution. Did you take distrete values to simplify your predictive probability calculation? – Geoffrey Johnson Aug 12 '21 at 15:09
  • It sounds like you agree with Ben that $\theta_i$ is a long-run probability. Can you address my concerns about blending two different probabilities in Bayes theorem? In the answer I provided, do you view Option 3 as correct? If not, can you provide the correct interpretation and justification for linking posterior probability statements to the unknown fixed true $\theta=c$? – Geoffrey Johnson Aug 12 '21 at 15:12
  • @GeoffreyJohnson Yes, for simplifying the illustration. – Lerner Zhang Aug 12 '21 at 15:14
  • @GeoffreyJohnson We initiate $\theta$ from the frequentist perspective but after seeing more observations it would contain evidence and the prior would vanish given enough observations. – Lerner Zhang Aug 12 '21 at 15:18
0
  • It's a measure of $X$ be $x$ given $\theta=c$ assuming that I will look through the distribution/estochastic-variability of $\theta$ (notice that depends from my distribution belief) in this particular event $\theta=c$ that is I will measure the odd of $X=x$ assuming both $\theta=c$ and its probabilistic uncertainty, furthermore, the math will show that $P(X=x|\theta=c)$ will depend from my belief to $\theta=c$ because the probability-view of the event $\theta=c$ (that will come up by the math of $P(X=x|\theta=c)$ ) will be my prior distribution of $\theta$ evaluating other belief that looks at $P(\theta=c)$ so $P(X=x|\theta=c)$ will depend from this other belief. I mean you are measuring the odd of $X=x$ taking your belief about $\theta=c$ although this was not implicitly clear.
Davi Américo
  • 737
  • 1
  • 11
-5

Below are four different interpretations using the coin toss example that was provided in the original question. Option 1 a) appears to be the appropriate interpretation under the Bayesian paradigm. If you find one of these that maps to your answer, please identify it and offer suggestions for improvement if needed.

Option 1: Probability statements about $X$ and probability statements about $\theta$ are both statements of personal belief.

a) $\theta$ is the limiting proportion of heads as the number of flips tends to infinity for the coin under investigation, an unknown fixed constant with value $c$ and is not a probability. The only valid probability is my belief about any given flip, which I set equal to this unknown fixed constant. Therefore, $P\{X=1|\theta=c\}=c$ is my personal belief about the coin landing heads in any given throw if I know this limiting proportion. Since I do not know what to believe, $\pi(\theta)$ is my belief about the limiting proportion (not my belief about the probability of heads). If I were to integrate the data pmf using the prior distribution I would get the prior predictive distribution. Then $P\{X=1\}=\frac{a}{a+b}$ where $\frac{a}{a+b}$ is a "known" constant. In a different sense this would be my belief about the coin landing heads when not knowing the limiting proportion. The posterior is my belief about the limiting proportion given the observed data. Nevertheless, the prior and posterior probabilities do not represent factual statements about the limiting proportion of heads for the coin under investigation, nor are they statements about the experiment. Option 1 a) amounts to Option 2 b) since this is how the posterior is operationalized in practice using Monte Carlo simulations.

b) $c$ is the limiting proportion of heads as the number of flips tends to infinity for the coin under investigation, an unknown fixed constant and not a probability. The only valid probability is my belief about any given flip, which I set equal to this unknown fixed constant. Therefore $\theta=c$, and equivalently $P\{X=1|\theta=c\}=c$, is my personal belief about the coin landing heads in any given throw if I know this limiting proportion. Since I do not know what to believe, $\pi(\theta)$ is my belief… about my belief. If I were to integrate the data pmf using the prior distribution I would get the prior predictive distribution. Then $P\{X=1\}=\frac{a}{a+b}$ where $\frac{a}{a+b}$ is a "known" constant. In a different sense this would also be my belief about my belief. This option has us applying a belief probability measure to a belief probability measure. Similarly for $\pi(\theta|\boldsymbol{x})$ and $P\{X=1|\boldsymbol{x}\}$. However, because of my original correspondence we can interpret prior and posterior probabilities as beliefs about the limiting proportion of heads for the coin under investigation. Nevertheless, these prior and posterior probabilities do not represent factual statements about the limiting proportion of heads for the coin under investigation, nor are they statements about the experiment. Option 1 b) amounts to Option 2 b) since this is how the posterior is operationalized in practice using Monte Carlo simulations.

Option 2: Probability statements about $X$ and probability statements about $\theta$ both have a frequentist interpretation.

a) $\theta=c$, and equivalently $P\{X=1|\theta=c\}=c$, is the limiting proportion of heads as the number of flips tends to infinity for the coin under investigation, an unknown fixed constant. The density $\pi(\theta)$ depicts a collection of $\theta$’s (coins) or the limiting proportions of randomly selected $\theta$’s (coins) as the number of draws from $\pi(\theta)$ tends to infinity. These probabilities are considered known constants. The unknown true $\theta=c$ under investigation was randomly selected from the known collection or prevalence of $\theta$’s according to $\pi(\theta)$, and the observed data is used to subset this collection forming the posterior. If we are to apply these posterior probability statements to make inference on the unknown true $\theta$ (coin) under investigation we have to change our sampling frame. We must imagine instead that the unknown true $\theta$ was instead randomly selected from the posterior. This, then, has cause and effect reversed since the posterior distribution, from which we selected $\theta$, depends on the data… but the data depended on the $\theta$ we had not yet selected from the posterior. We could imagine drawing a new $\theta$ (coin) from the posterior, but this would not be the same $\theta=c$ we started with under investigation. The challenge here is applying the probability statement in the posterior distribution to the unknown true $\theta$ (coin) under investigation in a meaningful way.

b) $\theta=c$, and equivalently $P\{X=1|\theta=c\}=c$, is the limiting proportion of heads as the number of flips tends to infinity for the coin under investigation, an unknown fixed constant. The density $\pi(\theta)$ depicts a collection of other $\theta$’s (coins) I have given to myself or the limiting proportions of randomly selected $\theta$’s (coins) as the number of draws from $\pi(\theta)$ tends to infinity. These probabilities are considered known constants. The observed data is used to subset this collection forming the posterior. The posterior is a legitimate sampling distribution of $\theta$'s (coins) I have given myself. The challenge here is applying the probability statements in the posterior distribution to the unknown true $\theta$ (coin) under investigation in a meaningful way since at no point was the true $\theta$ (coin) sampled from the posterior.

Option 3: Probability statements about $X$ have a frequentist interpretation and probability statements about $\theta$ represent personal belief.

$\theta=c$, and equivalently $P\{X=1|\theta=c\}=c$, is the limiting proportion of heads as the number of flips tends to infinity for the coin under investigation. Since I do not know this limiting proportion, $\pi(\theta)$ is my personal belief about this unknown fixed quantity. On the surface this seems the most reasonable. However, this would have Bayes theorem blending two different interpretations of probability as if they are compatible or equivalent, and it does not provide a clear link between posterior probability and the unknown fixed true $\theta$ under investigation. This would mean we are dealing with Option 1 or Option 2. Even if one insists on two different yet compatible interpretations of probability, Option 3 amounts to Option 2 b) since this is how the posterior is operationalized in practice using Monte Carlo simulations.

Option 4: Probability statements about $X$ have a frequentist interpretation and there are no probability statements about $\theta$.

$\theta=c$, and equivalently $P\{X=1|\theta=c\}=c$, is the limiting proportion of heads for the coin under investigation. The reason Bayesian statistics can provide reasonable point and interval estimates despite the shortcomings above regarding interpretation is that at the core of every prior is a likelihood. Something was witnessed or observed that gave rise to a likelihood, and therefore the prior. There are in fact no probability statements about $\theta$, belief, long-run, or otherwise. Bayes theorem amounts to multiplying independent likelihoods equivalent to a fixed effect meta-analysis, except the Bayesian normalizes the joint likelihood instead of inverting a hypothesis test. If we view the Bayesian inference machine as a frequentist meta-analytic testing procedure the shortcomings above vanish. The posterior is an asymptotic confidence distribution. Bayesian belief is more objectively viewed as confidence based on frequency probability of the experiment.

Geoffrey Johnson
  • 2,460
  • 3
  • 12
  • 6
    At present this defines $\theta^*$ (in the top section) in a self-referential way, which does not seem to be a helpful *definition* at all. – Ben Aug 11 '21 at 23:08
  • Hi Ben, if none of my proposed interpretations is correct, can you provide one that is? – Geoffrey Johnson Aug 12 '21 at 00:03
  • 2
    See my answer above for a definition of the parameter $\theta$. Your values $\theta^*$ and $c$ appear to be just stipulated values for this parameter. – Ben Aug 12 '21 at 00:09
  • There is a parameter space, the unknown fixed true $\theta$ lives in that parameter space, and I am saying this unknown fixed true $\theta$ is equal to $c$, a particular numerical value. If I did not write a value $c$ it would look like I am conditioning on a random variable (from a Bayesian's perspective). It's not a stipulated value, it is the number that is $\theta$ in your limit above. – Geoffrey Johnson Aug 12 '21 at 00:34
  • If I am correct in understanding that $\theta$ in the limit you provided is an unknown constant $c$, does that mean my Option 3 is the preferred Bayesian interpretation? There are two interpretations for probability, $\theta$ is an unknown fixed constant, and it is acceptable to blend belief and long-run probability as if they are interchangeable? – Geoffrey Johnson Aug 12 '21 at 01:37
  • 1
    I agree with @Ben - $P\{X=1|θ=c\}=c$ is the **true** probability that $X=1$, i.e. the "physical" probability. The problem is that neither frequentists, nor Bayesians know the true probability - frequentists equate this with long run frequencies, giving methods of estimation, Bayesians do so by updating their distribution reflecting the relative plausibilities of all possible values. – Dikran Marsupial Aug 25 '21 at 09:52
  • @Dikran Marsupial, your suggestion maps to Option 3 in the answer I provided. – Geoffrey Johnson Aug 25 '21 at 17:36
  • 1
    @GeoffreyJohnson again, you have not engaged with the point I have raised. "P{X=1|θ=c}=c" is not really meaningful in Bayesian analsysis because we don't know $c$, so it doesn't really enter into the analysis. – Dikran Marsupial Aug 25 '21 at 17:43
  • It is what you are making inference on. If it is not meaningful then what is the analysis for? – Geoffrey Johnson Aug 25 '21 at 17:46
  • 1
    No, we make inference on $\theta$, we don't set it to some fixed value, as I said $P(X=1|\theta=c) = c$ is just a tautology for a Bernoilli model - I've been trying to find a deeper meaning to it, based on what you have written, but your refusal to clarify isn't making it any easier to understand your position. Hence we talk past each other. – Dikran Marsupial Aug 25 '21 at 17:51
  • I'm not setting it equal to a value. I'm saying it has a value, an unknown value, and I am denoting this unkown value by $c$. If I did not identify a single value a Bayesian might mistake this for a random variable. – Geoffrey Johnson Aug 25 '21 at 18:05
  • 1
    @GeoffreyJohnson "I'm not setting it equal to a value", also GeoffreyJohnson "P(X=1|θ=c)=c is my belief about the coin landing heads in any given through, and I set this belief equal to the long-run proportion." (see the discussion of my answer https://stats.stackexchange.com/questions/539351/how-do-bayesians-interpret-px-x-theta-c-and-does-this-pose-a-challenge-whe/541308?noredirect=1#comment994344_541308). For a Bernoulli trial $P(X=1|\theta) = \theta$ so setting that to $c$ is setting $\theta = c$. $\theta$ is explicitly set to $c$ in your notation. – Dikran Marsupial Aug 26 '21 at 06:49
  • You must read Options 1 through 3 independently. In Option 1 and Option 3, I am not setting anything equal. $\theta=c$ is the tautology you speak of. I am simply making it explicit. In Option 2 $c$ is the limiting proportion of heads, and I am setting my belief in heads, P(X=1|$\theta$), equal to this unknown value $c$. – Geoffrey Johnson Aug 26 '21 at 13:22
  • 3
    ". In Option 2 c is the limiting proportion of heads, and I am setting my belief in heads, P(X=1|θ), equal to this unknown value c. – " Bayesians don't do that, as I have **repeatedly** pointed out. Sorry, I'm done, but you can't say I didn't try to help/understand. – Dikran Marsupial Aug 26 '21 at 14:28
  • I do appreciate your help. The Options I have provided are tentative proposals. Some of them may very well be blatantly wrong. My hope is that when someone provides an answer we can map it to one of my Options and improve that Option so it is correct. – Geoffrey Johnson Aug 26 '21 at 14:34