72

It says on Wikipedia that:

the mathematics [of probability] is largely independent of any interpretation of probability.

Question: Then if we want to be mathematically correct, shouldn't we disallow any interpretation of probability? I.e., are both Bayesian and frequentism mathematically incorrect?

I don't like philosophy, but I do like math, and I want to work exclusively within the framework of Kolmogorov's axioms. If this is my goal, should it follow from what it says on Wikipedia that I should reject both Bayesianism and frequentism? If the concepts are purely philosophical and not at all mathematical, then why do they appear in statistics in the first place?

Background/Context:
This blog post doesn't quite say the same thing, but it does argue that attempting to classify techniques as "Bayesian" or "frequentist" is counter-productive from a pragmatic perspective.

If the quote from Wikipedia is true, then it seems like from a philosophical perspective attempting to classify statistical methods is also counter-productive -- if a method is mathematically correct, then it is valid to use the method when the assumptions of the underlying mathematics hold, otherwise, if it is not mathematically correct or if the assumptions do not hold, then it is invalid to use it.

On the other hand, a lot of people seem to identify "Bayesian inference" with probability theory (i.e. Kolmogorov's axioms), although I'm not quite sure why. Some examples are Jaynes's treatise on Bayesian inference called "Probability", as well as James Stone's book "Bayes' Rule". So if I took these claims at face value, that means I should prefer Bayesianism.

However, Casella and Berger's book seems like it is frequentist because it discusses maximum likelihood estimators but ignores maximum a posteriori estimators, but it also seems like everything therein is mathematically correct.

So then wouldn't it follow that the only mathematically correct version of statistics is that which refuses to be anything but entirely agnostic with respect to Bayesianism and frequentism? If methods with both classifications are mathematically correct, then isn't it improper practice to prefer some over the others, because that would be prioritizing vague, ill-defined philosophy over precise, well-defined mathematics?

Summary: In short, I don't understand what the mathematical basis is for the Bayesian versus frequentist debate, and if there is no mathematical basis for the debate (which is what Wikipedia claims), I don't understand why it is tolerated at all in academic discourse.

innisfree
  • 1,124
  • 6
  • 23
Chill2Macht
  • 5,639
  • 4
  • 25
  • 51
  • 6
    Perhaps also of interest: [Do Bayesians accept Kolmogorov axioms?](http://stats.stackexchange.com/q/126056/17230). – Scortchi - Reinstate Monica Aug 18 '16 at 08:42
  • Possible duplicate of *[Where did the frequentist-Bayesian debate go?](http://stats.stackexchange.com/questions/20558/where-did-the-frequentist-bayesian-debate-go)* – Peter Mortensen Aug 18 '16 at 19:47
  • 1
    @PeterMortensen I already saw that question before asking this question; however the answer to that question did not address my primary source of confusion, namely what _mathematical_ difference, if any exists between the two; remember that I am not interested in philosophical differences since they shouldn't have any bearing on the space of possible models. – Chill2Macht Aug 18 '16 at 19:56
  • 1
    Comments are not for extended discussion; this conversation has been [moved to chat](http://chat.stackexchange.com/rooms/44174/discussion-on-question-by-william-is-there-any-mathematical-basis-for-the-baye). – whuber Aug 18 '16 at 23:42
  • 4
    The bayesean debate is less about *probability* and much more about *statistical interpretation* and the validity of it's application. – RBarryYoung Aug 19 '16 at 19:54
  • 1
    Possible duplicate of [Examples of Bayesian and frequentist approach giving different answers](http://stats.stackexchange.com/questions/43471/examples-of-bayesian-and-frequentist-approach-giving-different-answers) – user541686 Aug 21 '16 at 01:16
  • 2
    @Mehrdad This question is not about the different approaches giving different answers, it is about the possibility of formalizing, via mathematical axioms, the difference between Bayesianism and frequentism. The answers to the linked-to question do not explain the axiomatic differences between the two approaches. – Chill2Macht Aug 21 '16 at 01:20
  • "The disagreement between Bayesians and frequentists arises from a clash between two extreme positions. Bayesians assume that our prior uncertainty should _always_ be framed in terms of mathematical probabilities; frequentists assume it should play no role in our deliberations." p. 339, Weisberg, _Willful Ignorance_. My question is whether this difference, as so described, can be formulated mathematically. I.e. what axioms are appropriate for the mathematical models that encode Bayesian inference, and "" frequentist inference? – Chill2Macht Apr 20 '17 at 17:55
  • Inasmuch as statistical inference can ever be considered to give "correct" answers, it must be in the context of simplifying assumptions about the reality of the situation. In a mathematical model, such simplifying assumptions can always be stated as (mathematical) axioms. Since statistical inference uses mathematical tools, presumably its simplifying assumptions can be stated as mathematical axioms, as part of a mathematical model. My question is what those simplifying assumptions/mathematical axioms are, vis a vis the frequentist version of inference versus the Bayesian version of inference. – Chill2Macht Apr 20 '17 at 17:58

11 Answers11

68

Stats is not Math

First, I steal @whuber's words from a comment in Stats is not maths? (applied in a different context, so I'm stealing words, not citing):

If you were to replace "statistics" by "chemistry," "economics," "engineering," or any other field that employs mathematics (such as home economics), it appears none of your argument would change.

All these fields are allowed to exist and to have questions that are not solved only by checking which theorems are correct. Though some answers at Stats is not maths? disagree, I think it is clear that statistics is not (pure) mathematics. If you want to do probability theory, a branch of (pure) mathematics, you may indeed ignore all debates of the kind you ask about. If you want to apply probability theory into modeling some real-world questions, you need something more to guide you than just the axioms and theorems of the mathematical framework. The remainder of the answer is rambling about this point.

The claim "if we want to be mathematically correct, shouldn't we disallow any interpretation of probability" also seems unjustified. Putting an interpretation on top of a mathematical framework does not make the mathematics incorrect (as long as the interpretation is not claimed to be a theorem in the mathematical framework).

The debate is not (mainly) about axioms

Though there are some alternative axiomatizations*, the(?) debate is not about disputing Kolmogorov axioms. Ignoring some subtleties with zero-measure conditioning events, leading to regular conditional probability etc., about which I don't know enough, the Kolmogorov axioms and conditional probability imply the Bayes rule, which no-one disputes. However, if $X$ is not even a random variable in your model (model in the sense of the mathematical setup consisting of a probability space or a family of them, random variables, etc.), it is of course not possible to compute the conditional distribution $P(X\mid Y)$. No-one also disputes that the frequency properties, if correctly computed, are consequences of the model. For example, the conditional distributions $p(y\mid \theta)$ in a Bayesian model define an indexed family of probability distributions $p(y; \theta)$ by simply letting $p(y \mid \theta) = p(y; \theta)$ and if some results hold for all $\theta$ in the latter, they hold for all $\theta$ in the former, too.

The debate is about how to apply the mathematics

The debates (as much as any exist**), are instead about how to decide what kind of probability model to set up for a (real-life, non-mathematical) problem and which implications of the model are relevant for drawing (real-life) conclusions. But these questions would exist even if all statisticians agreed. To quote from the blog post you linked to [1], we want to answer questions like

How should I design a roulette so my casino makes $? Does this fertilizer increase crop yield? Does streptomycin cure pulmonary tuberculosis? Does smoking cause cancer? What movie would would this user enjoy? Which baseball player should the Red Sox give a contract to? Should this patient receive chemotherapy?

The axioms of probability theory do not even contain a definition of baseball, so it is obvious that "Red Sox should give a contract to baseball player X" is not a theorem in probability theory.

Note about mathematical justifications of the Bayesian approach

There are 'mathematical justifications' for considering all unknowns as probabilistic such as the Cox theorem that Jaynes refers to, (though I hear it has mathematical problems, that may or not have been fixed, I don't know, see [2] and references therein) or the (subjective Bayesian) Savage approach (I've heard this is in [3] but haven't ever read the book) that proves that under certain assumptions, a rational decision-maker will have a probability distribution over states of world and select his action based on maximizing the expected value of a utility function. However, whether or not the manager of Red Sox should accept the assumptions, or whether we should accept the theory that smoking causes cancer, cannot be deduced from any mathematical framework, so the debate cannot be (only) about the correctness of these justifications as theorems.

Footnotes

*I have not studied it, but I've heard de Finetti has an approach where conditional probabilities are primitives rather than obtained from the (unconditional) measure by conditioning. [4] mentions a debate between (Bayesians) José Bernardo, Dennis Lindley and Bruno de Finetti in a cosy French restaurant about whether $\sigma$-additivity is needed.

**as mentioned in the blog post you link to [1], there might be no clear cut debate with every statistician belonging to one team and despising the other team. I have heard it said that we are all pragmatics nowadays and the useless debate is over. However, in my experience these differences exist in, for example, whether someone's first approach is to model all unknowns as random variables or not and how interested someone is in frequency guarantees.

References

[1] Simply Statistics, a statistical blog by Rafa Irizarry, Roger Peng, and Jeff Leek, "I declare the Bayesian vs. Frequentist debate over for data scientists", 13 Oct 2014, http://simplystatistics.org/2014/10/13/as-an-applied-statistician-i-find-the-frequentists-versus-bayesians-debate-completely-inconsequential/

[2] Dupré, M. J., & Tipler, F. J. (2009). New axioms for rigorous Bayesian probability. Bayesian Analysis, 4(3), 599-606. http://projecteuclid.org/download/pdf_1/euclid.ba/1340369856

[3] Savage, L. J. (1972). The foundations of statistics. Courier Corporation.

[4] Bernardo, J.M. The Valencia Story - Some details of the origin and development of the Valencia International Meetings on Bayesian Statistics. http://www.uv.es/bernardo/ValenciaStory.pdf

Juho Kokkala
  • 7,463
  • 4
  • 27
  • 46
  • 14
    +1, in particular for "The axioms of probability theory do not even contain a definition of baseball". – amoeba Aug 18 '16 at 22:05
  • 5
    @William: The parameter is not *believed* to be a constant random variable - that's not a fact to be deduced or observed. The question is whether or not to represent epistemic uncertainty about the true value of the parameter using a probability distribution. (Frequentist analysis represents only the aleatory data-generating process using a probability distribution.) – Scortchi - Reinstate Monica Aug 19 '16 at 14:29
  • 4
    @William the classical Monty Hall has nothing that would reasonably be interpreted as a parameter or as data, it's a probability problem. Bayesian/frequentist approach would only come into play if you wanted to estimate, say, the parameter $q$ of the parametrized variant described here https://en.wikipedia.org/wiki/Monty_Hall_problem#Variants by watching multiple episodes of the gameshow. I, as a Bayesian, would probably put, e.g., a beta prior over $q$ and start updating. Whether this would work well in a computer simulation could depend strongly on how the computer simulation selects $q$. – Juho Kokkala Aug 19 '16 at 14:37
  • 8
    I preemptively note that I am not interested in continuing any debate over this in the comment section, since it (nor this site at all) is not a place for debates. – Juho Kokkala Aug 19 '16 at 14:46
  • 1
    @amoeba: axioms seldom contain definitions :-), as I explain in my answer, an axiomatic system starts from axioms, then defines some ''things'' and then derives ''theorems''. –  Aug 22 '16 at 18:16
  • 2
    I completely agree "stats are not math". Wigner wrote an essay called "The Unreasonable Effectiveness of Mathematics in Physics", which argued that since there was no inherent connection between the abstract world of mathematics and the concrete world of physics. It was surprising (and wonderful) that mathematics worked so well in describing physics. I feel the same is true for statistics. I look forward to someone writing "The Unreasonable Effectiveness of Mathematics in Statistics". I personally find it amazing that abstract mathematics works so well in describing statistical phenomena. – meh Aug 22 '16 at 18:30
  • @Scortchi Then why would we expect that the data-generating process is truly random (i.e. non-deterministic)? Isn't it more often the case that the data-generating process is most likely deterministic, but too complex and determined by unknown information, such that assigning a probability distribution to the samples is also more of a mathematical convenience reflecting our ignorance of reality rather than something meant to be exactly true? Why is it appropriate to model samples as random quantities but not parameters, even though both (in most cases) are clearly entirely deterministic? – Chill2Macht Dec 09 '17 at 15:25
  • 1
    @Chill2Macht: Sure, statisticians don't need to worry abut what 'truly random' might mean, just to come up with stochastic models of data-generating processes that are good enough for government work. (And when you think about it, in many studies data-generating processes that suit such models are actively contrived at: rather directly by e.g. assigning experimental treatments or sampling from populationsby drawing numbers out of a hat; or less so, e.g. in the design of measuring instruments & the writing of instructions for their use.) – Scortchi - Reinstate Monica Dec 10 '17 at 13:27
  • Now if you can also come up with a sensible stochastic model for a *parameter*-generating process, then you can carry out a Bayesian analysis while sticking with the frequentist concept of probability. Examples from Genetics often appear in textbooks: the genotype of a given individual being the parameter of interest, with known genotypes of ancestors furnishing a prior. But suppose the parameter of interest were, say, the speed of light in a vacuum - surely any postulated parameter-generating process would be pure fancy. – Scortchi - Reinstate Monica Dec 10 '17 at 18:00
33

The mathematical basis for the Bayesian vs frequentist debate is very simple. In Bayesian statistics the unknown parameter is treated as a random variable; in frequentist statistics it is treated as a fixed element. Since a random variable is a much more complicated mathematical object than a simple element of the set, the mathematical difference is quite evident.

However, it turns out that the actual results in terms of models can be surprisingly similar. Take linear regression for example. Bayesian linear regression with uninformative priors leads to a distribution of a regression parameter estimate, whose mean is equal to the estimate of the parameter of frequentist linear regression, which is a solution to a least squares problem, which is not even a problem from probability theory. Nevertheless the mathematics which was used to arrive at the similar solution is quite different, for the reason stated above.

Naturally because of the difference of treatment of the unknown parameter mathematical properties (random variable vs element of the set) both Bayesian and frequentist statistics hit on cases where it might seem that it is more advantageous to use a competing approach. Confidence intervals is a prime example. Not having to rely on MCMC to get a simple estimate is another. However, these are usually more matters of taste and not of mathematics.

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
mpiktas
  • 33,140
  • 5
  • 82
  • 138
  • 6
    Although the constant is a special case of a random variable, I would hesitate to conclude that Bayesianism is more general. You would not get frequentist results from Bayesian ones by simply collapsing the random variable to a constant. The difference is more profound. When you assume that your parameter is the unknown constant, the focal point of study becomes the estimate, which is a random variable (since it is a measurable function of the sample) and how close it is to the the true value of the parameter, or in what way to get the estimate so it is close to the true estimate. – mpiktas Aug 19 '16 at 07:08
  • 6
    Since the estimate is a random variable, you cannot study it by ignoring measure theory, so I find your statement that many statisticians display an astonishing amount of ignorance and disdain for measure theory quite surprising. Have you read Asymptotic Statistics by A. van der Vaart? I would consider this book a very good overview of frequentist statistics and measure theory features quite prominently there. – mpiktas Aug 19 '16 at 07:14
  • 3
    Bayesian statistics on the other hand derive the distribution of parameter almost immediately and then the question is how to actually compute it (lots of research on various sampling algorithms, Metropolis-Hastings, etc) and what is the importance of priors. I am not that familiar with the research on Bayesian statistics, so my generalisation might be off a bit. Going to personal preferences, not-withstanding the fact that I was trained more or less as a frequentist, I do not like that Bayesian statistics uses quite a restricted subset of available distributions... – mpiktas Aug 19 '16 at 07:25
  • 3
    It always starts with normal distribution and its conjugates and how far this gets you. Since almost all data I work is not normally distributed, I am immediately suspicious and prefer to work with methods which are distribution agnostic. However this is a personal preference, and I find that in applied work I do I have not yet found a problem for which frequentist approach would fail so spectacularly that I would need to switch to Bayesian one. – mpiktas Aug 19 '16 at 07:31
  • 4
    "It always starts with normal distribution and its conjugates and how far this gets you..." - this is why one uses Monte Carlo methods to sample from the posterior parameter distribution; these work also for general distributions (BUGS software and its variants). – John Donn Aug 19 '16 at 17:03
  • 1
    @mpiktas Thank you for the clarification -- I was not aware of any of this. I am going to look into acquiring a copy of Asymptotic Statistics by A. van der Vaart -- I had not heard of it before. "You would not get frequentist results by simply collapsing the random variable to a constant. The difference is more profound." Should I ask a new question about elaborating on this? This seems like the most important point that has been raised in this discussion, but I don't know/understand why this is true. – Chill2Macht Aug 19 '16 at 19:22
  • 2
    @William It might indeed be useful to open a new question about that -- at least I did not understand from this question that that would be something you are after. On the other hand, that sounds to be so close to "What is frequentist statistics?" that it might be more useful to do some more background reading first. E.g., do you know that when considering whether an estimator is unbiased one takes into account multiple possible values of the parameter? – Juho Kokkala Aug 20 '16 at 11:20
  • @JuhoKokkala I think so yes, if I remember correctly that was covered in Casella and Berger, which I am somewhat familiar with. How does that fact explain this? – Chill2Macht Aug 21 '16 at 01:47
  • I guess I don't understand how you understand 'collapsing the random variable to a constant'. If we assumed $\theta=3 a.s.$ we wouldn't care about the behavior of the estimator when $\theta=4$? (Let's not continue this discussion here) – Juho Kokkala Aug 21 '16 at 05:46
  • I don't find the defintion of ''frequentist'' in here, in your comment to my answer you referred to your answer but I can't find it ? –  Aug 23 '16 at 10:29
  • @mpiktas: probably this is intersting http://stats.stackexchange.com/questions/31867/bayesian-vs-frequentist-interpretations-of-probability/31868#31868 –  Aug 23 '16 at 10:34
28

I don't like philosophy, but I do like math, and I want to work exclusively within the framework of Kolmogorov's axioms.

How exactly would you apply Kolmogorov's axioms alone without any interpretation? How would you interpret probability? What would you say to someone who asked you "What does your estimate of probability $0.5$ mean?" Would you say that your result is a number $0.5$, which is correct since it follows the axioms? Without any interpretation you couldn't say that this suggests how often we would expect to see the outcome if we repeated our experiment. Nor could you say that this number tells you how certain are you about the chance of an event happening. Nor could you answer that this tells you how likely do you believe the event to be. How would you interpret expected value - as some numbers multiplied by some other numbers and summed together that are valid since they follow the axioms and a few other theorems?

If you want to apply the mathematics to the real world, then you need to interpret it. The numbers alone without interpretations are... numbers. People do not calculate expected values to estimate expected values, but to learn something about reality.

Moreover, probability is abstract, while we apply statistics (and probability per se) to real world happenings. Take the most basic example: a fair coin. In the frequentist interpretation, if you threw such a coin a large number of times, you would expect the same number of heads and tails. However, in a real-life experiment this would almost never happen. So $0.5$ probability has really nothing to to with any particular coin thrown any particular number of times.

Probability does not exist

-- Bruno de Finetti

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
Tim
  • 108,699
  • 20
  • 212
  • 390
  • 3
    "If you threw such a coin a large number of times, you would expect the same number of heads and tails" -- that is an incorrect understanding of the law of large numbers. See chapter III of Volume 1 of Feller's _An Introduction to Probability Theory and Applications_. For example, on p.67 "In a population of normal coins the majority is necessarily maladjusted". – Chill2Macht Aug 18 '16 at 18:26
  • 1
    @William so what exactly would you answer to the question "what does p=0.5 mean?" where p is estimate of probability on coin tossing experiment...? – Tim Aug 18 '16 at 19:41
  • 1
    You are also quoting Feller who mentions "majority" - majority of what exactly if you are not making frequentist interpretations of probability..? – Tim Aug 18 '16 at 19:50
  • @William they are "possible" in what sense exactly? I am going to toss the coin only once more, you tell me that p = 0.5, so that does p say about my future toss? Majority in my single future toss is 1 head or 1 tail, so ok, let it be that 0.5 is correct. But what if you told me that p = 0.3? How in hell "majority" of a single toss can be 0.3?! (Without making any frequentist interpretations.) – Tim Aug 18 '16 at 20:18
  • @William but **I am** asking about some particular coin that is going to be thrown only once. So it seems that your definition does not apply to situation where I am going to throw it only once..? – Tim Aug 18 '16 at 20:29
  • @William and you **did not** make frequentist assumptions about probability in here?! You just made frequentist interpretation of probability. – Tim Aug 18 '16 at 20:42
  • 7
    Oversimplifying things: in frequentist viewpoint probability is related to proportions of events happening among possible events; in Bayesian interpretation is is about how much something is believable (see https://en.wikipedia.org/wiki/Probability#Interpretations). By telling me about sample space etc. you *assumed* that there is something besides the single future coin toss -- this is your **interpretation** of probability, since there is going to be only single toss, so the whole argument about sample space does not apply to it. You are perfectly right with your interpretation, but this is – Tim Aug 18 '16 at 20:53
  • 5
    interpretation. To apply probability to real-world happenings you need to make such interpretations. What is the probability that Trump wins US election in 2016? This question is unanswerable if you won't make assumptions about what probability is. – Tim Aug 18 '16 at 20:55
  • @William This quote about "necessarily maladjusted" coins is necessarily wrong: [coins cannot be biased](http://www.stat.columbia.edu/~gelman/research/published/diceRev2.pdf). – amoeba Aug 23 '16 at 17:52
  • @amoeba No it isn't. Read the chapter. – Chill2Macht Aug 23 '16 at 18:09
  • @William: I understood "maladjusted" to mean that $p(\mathrm{Heads})\ne 1/2$. For well-tossed coin it is impossible (i.e. "loaded" coins cannot exist), see the link above. If "maladjusted" means something else, then it's a misunderstanding from my side. – amoeba Aug 23 '16 at 18:16
15

Probability spaces and Kolmogorov's axioms

A probability space $\mathcal{P}$ is by definition a tripple $(\Omega, \mathcal{F}, \mathbb{P} )$ where $\Omega$ is a set of outcomes, $\mathcal{F}$ is a $\sigma$-algebra on the subsets of $\Omega$ and $\mathbb{P}$ is a probability-measure that fulfills the axioms of Kolmogorov, i.e. $\mathbb{P}$ is a function from $\mathcal{F}$ to $[0,1]$ such that $\mathbb{P}(\Omega)=1$ and for disjoint $E_1, E_2, \dots$ in $\mathcal{F}$ it holds that $P \left( \cup_{j=1}^\infty E_j \right)=\sum_{j=1}^\infty \mathbb{P}(E_j)$.

Within such a probability space one can, for two events $E_1, E_2$ in $\mathcal{F}$ define the conditional probability as $\mathbb{P}(E_1|_{E_2})\stackrel{def}{=}\frac{\mathbb{P}(E_1 \cap E_2)}{\mathbb{P}(E_2)}$

Note that:

  1. this ''conditional probability'' is only defined when $\mathbb{P}$ is defined on $\mathcal{F}$, so we need a probability space to be able to define conditional probabilities.
  2. A probability space is defined in very general terms (a set $\Omega$, a $\sigma$-algebra $\mathcal{F}$ and a probability measure $\mathbb{P}$), the only requirement is that certain properties should be fulfilled but apart from that these three elements can be ''anything''.

More detail can be found in this link

Bayes' rule holds in any (valid) probability space

From the definition of conditional probability it also holds that $\mathbb{P}(E_2|_{E_1})=\frac{\mathbb{P}(E_2 \cap E_1)}{\mathbb{P}(E_1)}$. And from the two latter equations we find Bayes' rule. So Bayes' rule holds (by definition of conditional probabilty) in any probability space (to show it, derive $\mathbb{P}(E_1 \cap E_2)$ and $\mathbb{P}(E_2 \cap E_1)$ from each equation and equate them (they are equal because intersection is commutative)).

As Bayes rule is the basis for Bayesian inference, one can do Bayesian analysis in any valid (i.e. fulfilling all conditions, a.o. Kolmogorov's axioms) probability space.

Frequentist definition of probability is a ''special case''

The above holds ''in general'', i.e. we have no specific $\Omega$, $\mathcal{F}$, $\mathbb{P}$ in mind as long as $\mathcal{F}$ is a $\sigma$-algebra on subsets of $\Omega$ and $\mathbb{P}$ fulfills Kolmogorov's axioms.

We will now show that a ''frequentist'' definition of $\mathbb{P}$ fulfills Kolomogorov's axioms. If that is the case then ''frequentist'' probabilities are only a special case of Kolmogorov's general and abstract probability.

Let's take an example and roll the dice. Then the set of all possible outcomes $\Omega$ is $\Omega=\{1,2,3,4,5,6\}$. We also need a $\sigma$-algebra on this set $\Omega$ and we take $\mathcal{F}$ the set of all subsets of $\Omega$, i.e. $\mathcal{F}=2^\Omega$.

We still have to define the probability measure $\mathbb{P}$ in a frequentist way. Therefore we define $\mathbb{P}(\{1\})$ as $\mathbb{P}(\{1\}) \stackrel{def}{=} \lim_{n \to +\infty} \frac{n_1}{n}$ where $n_1$ is the number of $1$'s obtained in $n$ rolls of the dice. Similar for $\mathbb{P}(\{2\})$, ... $\mathbb{P}(\{6\})$.

In this way $\mathbb{P}$ is defined for all singletons in $\mathcal{F}$. For any other set in $\mathcal{F}$, e.g. $\{1,2\}$ we define $\mathbb{P}(\{1,2\})$ in a frequentist way i.e. $\mathbb{P}(\{1,2\}) \stackrel{def}{=} \lim_{n \to +\infty} \frac{n_1+n_2}{n}$, but by the linearity of the 'lim', this is equal to $\mathbb{P}(\{1\})+\mathbb{P}(\{2\})$, which implies that Kolmogorov's axioms hold.

So the frequentist definition of probability is only a special case of Kolomogorov's general and abstract definition of a probability measure.

Note that there are other ways to define a probability measure that fulfills Kolmogorov's axioms, so the frequentist definition is not the only possible one.

Conclusion

The probability in Kolmogorov's axiomatic system is ''abstract'', it has no real meaning, it only has to fulfill conditions called ''axioms''. Using only these axioms Kolmogorov was able to derive a very rich set of theorems.

The frequentist definition of probability fullfills the axioms and therefore replacing the abstract, ''meaningless'' $\mathbb{P}$ by a probability defined in a frequentist way, all these theorems are valid because the ''frequentist probability'' is only a special case of Kolmogorov's abstract probability (i.e. it fulfills the axioms).

One of the properties that can be derived in Kolmogorov's general framework is Bayes rule. As it holds in the general and abstract framework, it will also hold (cfr supra) in the specific case that the probabilities are defined in a frequentist way (because the frequentist definition fulfills the axioms and these axioms were the only thing that is needed to derive all theorems). So one can do Bayesian analysis with a frequentist definition of probability.

Defining $\mathbb{P}$ in a frequentist way is not the only possibility, there are other ways to define it such that it fulfills the abstract axioms of Kolmogorov. Bayes' rule will also hold in these ''specific cases''. So one can also do Bayesian analysis with a non-frequentist definition of probability.

EDIT 23/8/2016

@mpiktas reaction to your comment:

As I said, the sets $\Omega, \mathcal{F}$ and the probability measure $\mathbb{P}$ have no particular meaning in the axiomatic system, they are abstract.

In order to apply this theory you have to give further definitions (so what you say in your comment "no need to muddle it further with some bizarre definitions'' is wrong, you need additional definitions).

Let's apply it to the case of tossing a fair coin. The set $\Omega$ in Kolmogorov's theory has no particular meaning, it just has to be ''a set''. So we must specify what this set is in case of the fair coin, i.e. we must define the set $\Omega$. If we represent head as H and tail as T, then the set $\Omega$ is by definition $\Omega\stackrel{def}{=}\{H,T\}$.

We also have to define the events, i.e. the $\sigma$-algebra $\mathcal{F}$. We define is as $\mathcal{F} \stackrel{def}{=} \{\emptyset, \{H\},\{T\},\{H,T\} \}$. It is easy to verify that $\mathcal{F}$ is a $\sigma$-algebra.

Next we must define for every event in $E \in \mathcal{F}$ its measure. So we need to define a map from $\mathcal{F}$ in $[0,1]$. I will define it in the frequentist way, for a fair coin, if I toss it a huge number of times, then the fraction of heads will be 0.5, so I define $\mathbb{P}(\{H\})\stackrel{def}{=}0.5$. Similarly I define $\mathbb{P}(\{T\})\stackrel{def}{=}0.5$, $\mathbb{P}(\{H,T\})\stackrel{def}{=}1$ and $\mathbb{P}(\emptyset)\stackrel{def}{=}0$. Note that $\mathbb{P}$ is a map from $\mathcal{F}$ in $[0,1]$ and that it fulfills Kolmogorov's axioms.

For a reference with the frequentist definition of probability see this link (at the end of the section 'definition') and this link.

  • 1
    The problem I have with the frequentist definition is that those limits don't seem like they are always well-defined nor that they return the same value for all elements of the event space, although I guess you probably mean to take the expectation of those values (since $n_1$ and $n_2$ are random variables). In any case, this is a very thorough answer which gives a good explanation in terms/language which I easily understand -- I would upvote twice if I could. – Chill2Macht Aug 22 '16 at 16:30
  • @William: thanks ! As you say $n_1$ and $n_2$ are random, but the fractions $n_1/n$ and $n_2/n$ converge (in the frequentist view) to some fixed, non-random number, and that number (in the frequentist view) is the probability (measure) of the outcome. –  Aug 22 '16 at 17:55
  • 16
    Perhaps one should note somewhere that there is a frequentist/Bayesian debate about the interpretation of probability and there is a frequentist/Bayesian debate about statistical inference. **These are two different (albeit related) debates.** This answer talks exclusively about the first one, which is fine (and I guess what @William was interested in here, as he chose to accept this answer), but most of the other answers talk mostly about the second one. This is just a note for future readers, but also a note to William. – amoeba Aug 22 '16 at 18:47
  • @amoeba: I know, but I only answered the question William asked. –  Aug 23 '16 at 05:50
  • 3
    I am voting down, because there is no reference to the definition of "frequentist probability" definition, and without it, the post does not make sense. For example the given definition of $P(\{1\})$ is not even mathematically correct, because the definition depends on a limit of a $n$ rolls of a dice. Mathematical objects are abstract and do not depend on physical objects. Furthermore to prove that the limit exists you need to construct a probability space, where the random variable $n_1/n$ is defined, and then prove that it converges, for which you need the measure theory and the ... – mpiktas Aug 23 '16 at 06:16
  • 2
    definition of the probability. So even if we allow such as definition it is circular, i.e. to check whether object satisfies the definition you need to have the object defined. I would dearly want to get a reference to a textbook which uses such a definition and tries to use it to derive all the usual results in statistics. – mpiktas Aug 23 '16 at 06:22
  • @mpiktas: this is the definition given by Kolmogorov. The definition of probability is in the third component of the tripple. You can Google 'axiomatic theory of probability'. –  Aug 23 '16 at 07:04
  • @fcop you say "Bayes' rule holds in any (valid) probability space" but **you can also use it with improper priors**, i.e. in cases that are not valid from axiomatic point of view. – Tim Aug 23 '16 at 07:22
  • @Tim: I wrote "As Bayes rule is the basis for Bayesian inference, one can do Bayesian analysis in any valid (i.e. fulfilling all conditions, a.o. Kolmogorov's axioms) probability space", that is not conflicting with what you say because as I did NOT (sorry for capitals, I don't know how to bold in comment) say that you can ONLY do it it valid probability spaces. So I agree with you, but the use of improper priors is only an ''extension'' of Bayes rule. –  Aug 23 '16 at 07:31
  • @fcop, Kolmogorov certainly did not give such definition. I suspect that you refer to the classical example of a finite probability space, given in the first chapter of A. N. Shyriaev Probability (http://link.springer.com/chapter/10.1007/978-1-4757-2539-1_2). If you look at this source, and not rely on google, you would see that the definition of probability follows measure theory, i.e. there is no need to check whether it satisfies Kolmogorov axioms, because they are implicitly used in the definition. Please provide a credible source for your claims. – mpiktas Aug 23 '16 at 07:33
  • @mpiktas: this is what Kolmogorov wrote (translated): https://www.york.ac.uk/depts/maths/histstat/kolmogorov_foundations.pdf and obviously it is based on measure theory, that's way $\mathbb{P}$ is called a probability measure. So I think your downvote was uunfair ... –  Aug 23 '16 at 07:35
  • @fcop, we refer practically to the same source. However I fail to find the definition $P(\{1\})=\lim \frac{n_1}{n}$, – mpiktas Aug 23 '16 at 07:40
  • @mpiktas: how would you formulate the frequentist defintion that the probability of throwing '1' with a dice is 1/6. I would say that, if I roll the dice a large number of times, then the fraction of '1' would be 1/6, transalted to maths: $lim_{n \to +\infty} n_1/n = 1/6$. Do you see any other way ? –  Aug 23 '16 at 07:42
  • @fcop as I said, please find me the reference to this definition of probability. Since the original question is about the mathematical details, we should stick to mathematical definitions. As I've already said your given definition does not make sense. – mpiktas Aug 23 '16 at 07:48
  • @mpiktas: see on top of page numbered 22 of this PDF: http://www.nyu.edu/econ/dept/courses/peracchi/ustat3.pdf, probably it is good that you give your definition of frequentist probability now ? –  Aug 23 '16 at 07:57
  • @mpiktas: and if $\mathbb{P}(\{1\}) \stackrel{def}{=} \lim_{n \to +\infty} \frac{n_1}{n}$ is not a mathematical definition, how would you call it ? –  Aug 23 '16 at 08:00
  • @fcop, this is one example of a given definition of **finite** elementary probability space. It is not a definition. Furthermore it is a example of a distribution of a Bernoulli variable. I fail to understand why this example then gets a name of a "frequentist" probability, and then why it is needed to show that Kolmogorov axiom hold? The hold because example was based on these axioms, I think both of your sources make that clear. – mpiktas Aug 23 '16 at 08:06
  • 1
    @fcop, mathematical definition is more than a mere valid collection of mathematical signs. When you define something it needs to exist. In your case you need an already working definition of the left hand side to prove the existence of the limit on the right hand side. – mpiktas Aug 23 '16 at 08:09
  • @mpiktas: It is time to give your defintion of frequentist probability now ... –  Aug 23 '16 at 08:10
  • @mpiktas: take a coin, toss it one billion times and compute the fraction of heads, then write in a comment what you got. One billion times is a large number of occurrences, that's were 'frequentist' comes from. –  Aug 23 '16 at 08:14
  • 1
    @fcop, I've already given my answer to the OP question. There is one measure theoretic definition of probability, and there is no such thing as "frequentist" or "bayesian" probability definitions. There is Bayesian and frequentist statistics which difference I explained in my answer, but they use the same definition of probability. The debate between Bayesian and frequentist is muddled already and frankly in my opinion it is simply silly, so there is no need to muddle it further with some bizarre definitions. – mpiktas Aug 23 '16 at 08:25
  • @mpiktas: I will add my reaction in an EDIT section at the bottom of my answer –  Aug 23 '16 at 09:39
  • @mpiktas: probably this is intersting http://stats.stackexchange.com/questions/31867/bayesian-vs-frequentist-interpretations-of-probability/31868#31868 –  Aug 23 '16 at 10:34
  • @mpiktas: you argued that "I am voting down, because there is no reference to the definition of "frequentist probability" definition", you can undo that because I added a link at the bottom of my answer. –  Aug 23 '16 at 15:06
  • 7
    This long and detailed article in Stanford Encyclopedia of Philosophy [on Probability Interpretations](http://plato.stanford.edu/entries/probability-interpret/#FreInt) contains a long and detailed section on frequentism and might be a better reference than your link to Wikipedia (Stanford Encyclopedia is quite authoritative, unlike Wikipedia). It makes it clear that whether frequentist definition makes sense at all and even what exactly constitutes the frequentist definition is a matter of 150-years-long ongoing debate that you and @mpiktas seem to be re-enacting here in the comments section. – amoeba Aug 23 '16 at 15:20
  • @amoeba: thanks for the link, I will read it, but the discussion in the comments is different from the 150-year long debate. In these comments I argue that the frequentist definition of probability is the relative frequency of occurrence when the experiment is repeated a large number of times. So it is simply about what exactly frequentist probability is. I think that there is not so much debate about that,If $n$ is the number of times the experiment is repeated and $n_1$ the number of occurrences of the event then it is $lim_{n\to \infty} n_1/n$ , what do you think ? –  Aug 23 '16 at 15:29
  • The debate, in my opinion, has always been about whether this definition is meaningful (and whether it constitutes a "definition" at all). But this is not a place to continue this discussion, so I will stop now. – amoeba Aug 23 '16 at 15:58
  • @amoeba: well the discussion here is not about meaningfulness, it is just whether this IS (sorry for the capitals, I don't know how to bold in comments) the definition and there is less doubt about I think. –  Aug 23 '16 at 16:43
  • 2
    @amoeba: I particularly like the reminder in your link that we could interpret "probability" in all sorts of ways having nothing to do with the concept as usually understood - e.g. normalized length - & still remain consistent with Kolmogorov's axioms. – Scortchi - Reinstate Monica Aug 25 '16 at 15:22
  • @mpiktas: another defintion of 'frequentist probability': http://stats.stackexchange.com/questions/225353/probability-of-a-single-real-life-future-event-what-does-it-mean-when-they-say/225377#225377 –  Aug 28 '16 at 18:14
  • @mpiktas: see the answer of Aksakal on this question, Kolmogorov himself seems to have given the ''frequentist'' definition !! http://stats.stackexchange.com/questions/232356/who-are-frequentists –  Aug 29 '16 at 21:06
  • @amoeba:: see the answer of Aksakal on this question, Kolmogorov himself seems to have given the ''frequentist'' definition !! http://stats.stackexchange.com/questions/232356/who-are-frequentists –  Aug 29 '16 at 21:06
10

My view of the contrast between Bayesian and frequentist inference is that the first issue is the choice of the event for which you want a probability. Frequentists assume what you are trying to prove (e.g., a null hypothesis) then compute the probability of observing something that you already observed, under that assumption. There is an exact analogy between such reverse-information flow-order probabilities and sensitivity and specificity in medical diagnosis, which have caused enormous misunderstandings and need to be bailed out by Bayes' rule to get forward probabilities ("post-test probabilities"). Bayesians compute the probability of an event, and absolute probabilities are impossible to compute without an anchor (the prior). The Bayesian probability of the veracity of a statement is much different from the frequentist probability of observing data under a certain unknowable assumption. The differences are more pronounced when the frequentist must adjust for other analyses that have been done or could have been done (multiplicity; sequential testing, etc.).

So the discussion of the mathematical basis is very interesting and is a very appropriate discussion to have. But one has to make a fundamental choice of forwards vs. backwards probabilities. Hence what is conditioned upon, which isn't exactly math, is incredibly important. Bayesians believe that full conditioning on what you already know is key. Frequentists more often condition on what makes the mathematics simple.

Frank Harrell
  • 74,029
  • 5
  • 148
  • 322
9

I will break this up into two separate questions and answer each.

1.) Given the different philosophical views of what probability means in a Frequentist and Bayesian perspective, are there mathematical rules of probability that apply to one interpretation and do not apply to another?

No. The rules of probability remain exactly the same between the two groups.

2.) Do Bayesians and Frequentists use the same mathematical models to analyze data?

Generally speaking, no. This is because the two different interpretations suggest that a researcher can gain insight from different sources. In particular, the Frequentist framework is often thought to suggest that one can make inference on the parameters of interest only from the data observed, while a Bayesian perspective suggests that one should also include independent expert knowledge about the subject. Different data sources means different mathematical models will be used for analysis.

It is also of note that there are plenty divides between the models used by the two camps that is more related to what has been done than what can be done (i.e. many models that are traditionally used by one camp can be justified by the other camp). For example, BUGs models (Bayesian inference Using Gibbs sampling, a name that no longer accurately describes the set of models for many reasons) are traditionally analyzed with Bayesian methods, mostly due to the availability of great software packages to do this with (JAGs, Stan for example). However, there is nothing that says these models must be strictly Bayesian. In fact, I worked on the project NIMBLE that builds these models in the BUGs framework, but allows the user much more freedom on how to make inference on them. While the vast majority of the tools we provided were customizable Bayesian MCMC methods, one could also use maximum likelihood estimation, a traditionally Frequentist method, for these models as well. Similarly, priors are often thought of as what you can do with Bayesian that you cannot do with Frequentist models. However, penalized estimation can provide for the same models using regularizing parameter estimates (although the Bayesian framework provides an easier way of justifying and choosing regularization parameters, while Frequentists are left with, in a best case scenario of a lots of data, "we chose these regularization parameters because over a large number of cross-validated samples, they lowered the estimated out of sample error"...for better or for worse).

Cliff AB
  • 17,741
  • 1
  • 39
  • 84
  • 1
    I object, somewhat, to this quote: "In particular, the Frequentist framework is often thought to suggest that one can make inference on the parameters of interest only from the data observed, while a Bayesian perspective suggests that one should also include independent expert knowledge about the subject". Primarily for the implication that frequentists are, for whatever reason, uninterested in independent expert knowledge about the subject. The difference between frequentists and Bayesians isn't that the former stubbornly refuse to use prior knowledge or context ... (1/2) – Ryan Simmons Aug 22 '16 at 19:28
  • 1
    ... but rather that the two schools of thought utilize that prior knowledge/context in different ways. You may argue that the Bayesian perspective takes a more principled approach towards incorporating this prior knowledge directly into a model (though, I would argue the widespread usage of non-informative priors rather dilutes this argument). But I don't think it is fair to characterize it as being an issue of frequentists NOT using that information. (2/2) – Ryan Simmons Aug 22 '16 at 19:30
  • 1
    @RyanSimmons: right, this is why I stated "is often thought to suggest...". For example, if a researcher observes that regularizing parameter estimates around an expert's opinion tends to lead itself to better predictions in the long run, there is no problem in incorporating this in a Frequentist framework ("based on Frequentist measures, this augmented estimator has better long run operating characteristics than the data-only estimator"). But this is not as straightforward as in the Bayesian framework. – Cliff AB Aug 22 '16 at 19:40
  • 1
    Fair enough! I concur. – Ryan Simmons Aug 22 '16 at 19:44
5

Bayesians and Frequentists think probabilities represent different things. Frequentists think they're related to frequencies and only make sense in contexts where frequencies are possible. Bayesians view them as ways to represent uncertainty. Since any fact can be uncertain, you can talk about the probability of anything.

The mathematical consequence is that Frequentists think the basic equations of probability only sometimes apply, and Bayesians think they always apply. So they view the same equations as correct, but differ on how general they are.

This has the following practical consequences:

(1) Bayesians will derive their methods from the basic equations of probability theory (of which Bayes Theorem is just one example), while Frequentists invent one intuitive ad-hoc approach after another to solve each problem.

(2) There are theorems indicating that if you reason from incomplete information you had better use the basic equations of probability theory consistently, or you'll be in trouble. Lots of people have doubts about how meaningful such theorems are, yet this is what we see in practice.

For example, it's possible for real world innocent looking 95% Confidence Intervals to consist entirely of values which are provably impossible (from the same info used to derive the Confidence Interval). In other words, Frequentist methods can contradict simple deductive logic. Bayesian methods derived entirely from the basic equations of probability theory don't have this problem.

(3) Bayesian is strictly more general than Frequentist. Since there can be uncertainty about any fact, any fact can be assigned a probability. In particular, if the facts you're working on is related to real world frequencies (either as something you're predicting or part of the data) then Bayesian methods can consider and use them just as they would any other real world fact.

Consequently any problem Frequentist feel their methods apply to Bayesians can also work naturally. The reverse however is often not true unless Frequentists invent subterfuges to interpret their probability as a "frequency" such as, for example, imagining the multiple universes, or inventing hypothetical repetitions out to infinity which are never performed and often couldn't be in principle.

Laplace
  • 109
  • 1
  • 7
    Could you provide some references to the bold statements you have provided? For example " Frequentists think the basic equations of probability only sometimes apply"? And what are the basic equations of probability? – mpiktas Aug 18 '16 at 06:01
  • 6
    Much more interesting than the B vs F debate is your remark about Confidence intervals containing impossible values. Can you give or link to specific example of a 95% CI containing only impossible values? This could be one of those things every statistician should have seen at least once in their lives (as a cautionary tale), but I haven't. – Vincent Aug 18 '16 at 08:17
  • 9
    That a CI might contain all "impossible" values does not "contradict simple deductive logic" at all. This sounds like a misunderstanding of the definition of a CI--or perhaps a confusion between the interpretations of CIs and credible intervals. – whuber Aug 18 '16 at 14:05
  • 3
    Vincent - here is the reference http://bayes.wustl.edu/etj/articles/confidence.pdf look at Example 5 on page 196 (slide 22). This was apprently known very early on. 50+ or maybe even 70 years ago. – Laplace Aug 18 '16 at 16:21
  • 2
    whuber - you don't need to put "impossible" in scare quotes. If you're estimating a mass and you get a CI of (-2,-1) than those values are simple impossible without quotes. Every reasonable person would interpret a 95% CI all of whose values are provably impossible from the same information used to derive the CI as an inconsistency between Frequentism and deductive logic. Moreover they would interpret the fact that this can't happen with Bayesian methods derived entirely from the basic equations of probability theory as obvious advantage All the word games in the world cant change that. – Laplace Aug 18 '16 at 16:26
  • 1
    mpiktas - the basic equations of probability theory are the sum and product rules. They can be stated in various levels of generality, but you could take them to be P(A)+P(B)= P(A or B)-P(A and B) together with P(A and B) = P(A|B)P(B) with allowance for the fact that other forms might relevant in different situations. – Laplace Aug 18 '16 at 16:30
  • 1
    mpiktas - It's not a bold claim, rather it's so commonplace I'm honestly shocked anyone would try to deny it. Frequentist have said millions of times things like "the probability of Hillary winning in 2016 is meaningless because it isn't a random variable" or "the probability for the speed of light is meaningless because it's a fixed parameter and not a random variable". Bayes Theorem is a trivial Theorem of probability theory. Yet Frequentists refuse to use it in almost all real situations because they deny it applies or is meaningful. – Laplace Aug 18 '16 at 17:40
  • 1
    William, I personally haven't notice a difference between Frequentists and Bayesians on the Monty Hall problem in practice (others may have a different experience). Sometimes though, the belief that there is one true probability distribution definitely can clash with situations where probabilities should changed as the information they're conditional on changes. This is especially true if the new information isn't 'causally' related but it still highly relevant logically. – Laplace Aug 18 '16 at 19:23
  • 7
    This seems like a philosophical rant rather than an answer to the OP's question (which was strictly **not** about philosophy). – Cliff AB Aug 18 '16 at 21:10
  • 1
    Cliff AB - the question was "I don't understand what the mathematical basis is for the Bayesian versus frequentist debate" I explained the issue is subtle because their philosophical differences cause them to adopt the same equations, but with different ranges of applicability. I explained the mathematical consequences of that difference in both general and practical terms. See for example the reference above to a real CI that is guaranteed to be wrong using the same info used to construct the CI and the corresponding Bayesian Credibility Interval which trivially avoids the same mistake. – Laplace Aug 18 '16 at 22:06
  • 2
    The only time you mention differences of mathematics is "The mathematical consequence is that Frequentists think the basic equations of probability only sometimes apply, and Bayesians think they always apply." which is an unsupported statement (at least in all that you have posted). If you are implying your statement about a CI having invalid values means that Frequentists don't respect probability, then you are merely misinterpreting a confidence interval. – Cliff AB Aug 18 '16 at 22:27
  • 1
    Cliff AB - I do claim Frequentists only sometimes think the equations of probability apply. Bayes Theorem is a trivial consequence of the equations of probability. Frequentists claim most applications of Bayes Theorem made by Bayesian are illegitimate. They only allow Bayes Theorem when the prior can be interpreted as the frequency of a random variable. As evidence I cite literally every criticism of Bayesians written by any Frequentist since the dawn of time. – Laplace Aug 19 '16 at 02:03
  • 1
    Cliff AB - I'll rephrase the part about CIs. It's possible for the inference that every statistician would make from a CI (without which CI's have no practical purpose or contact with the real world) to contradict what can be deduced from the same evidence. The mathematical reason this doesn't happen with Bayesian methods ultimately traces back to the fact that Bayesians don't restrict probabilities to just 'random variables' and will thus use the equations of probability theory in situations Frequentists don't. – Laplace Aug 19 '16 at 02:07
  • 5
    "It's possible for the inference that every statistician would make from a CI (without which CI's have no practical purpose or contact with the real world) to contradict what can be deduced from the same evidence". This **still** in no way backs your claim that Frequentists ignore the rules of probability. And I'm afraid this is going down the well trodden path of "Bayes vs Frequentists: fight!" that most readers here would prefer to avoid. – Cliff AB Aug 19 '16 at 04:21
  • 1
    I never said "ignore the rules of probability", what I said was they don't always follow them. One consequence of the "rules of probability" is Bayes Theorem and it's a defining characteristic of Frequentists that they claim this shouldn't be followed in most instances. – Laplace Aug 19 '16 at 04:27
  • 2
    If you do not work with conditional probabilities there is no need to apply Bayes theorem. This does not mean that in this case the statement is made that Bayes theorem shouldn't be followed. Your so called rules of probability are actually simple statements of measure theory and you will not find any paper in statistics which can ignore measure theory. – mpiktas Aug 22 '16 at 05:54
3

Question: Then if we want to be mathematically correct, shouldn't we disallow any interpretation of probability? I.e., are both Bayesian and frequentism mathematically incorrect?

Yes, and this is exactly what people do both in Philosophy of Science and in Mathematics.

  1. Philosophical approach. Wikipedia provides a compendium of interpretations/definitions of probability.

  2. Mathematicians are not safe. In the past, the Kolmogorovian school had the monopoly of probability: a probability is defined as a finite measure that assigns 1 to the whole space ... This hegemony is no longer valid since there are new trends on defininig probability such as Quantum probability and Free probability.

Silverfish
  • 20,678
  • 23
  • 92
  • 180
Tim Allen
  • 33
  • 3
  • Do you understand what is meant by relaxing assumptions of commutativity of random variables? (with regards to free probability -- I don't know enough QM to understand the ideas behind quantum probability) Does this mean that $X + Y \not= Y+X$ or $XY \not= YX$? I guess the discussion of von Neumann algebras and $C^*$ algebras imply the latter. – Chill2Macht Aug 19 '16 at 19:32
  • 7
    @William $C^{*}$ algebras do not correctly model most of what statistics is applied to. (By analogy, the invention of complex numbers in no way affected any application of the natural numbers to phenomena. No possible extension of the mathematical concept of probability would ever change how probability--as currently understood--is applied, either.) Tim, this answer is puzzling: the only purely *mathematical* issue concerning any application of probability is whether its axioms are consistent, and that is easily proven with simple models. – whuber Aug 19 '16 at 20:22
3

The bayes/frequentist debate is based on numerous grounds. If you are talking about mathematical basis, I don't think there is much.

They both need to apply various approximate methods for complex problems. Two examples are "bootstrap" for frequentist and "mcmc" for bayesian.

They both come with rituals/procedures for how to use them. A frequentist example is "propose an estimator of something and evaluate its properties under repeated sampling" while a bayesian example is "calculate probability distributions for what you don't know conditional on what you do know". There is no mathematical basis for using probabilities in this way.

The debate is more about application, interpretation, and ability to solve real world problems.

In fact, this is often used by people debating "their side" where they will use a specific "ritual/procedure" used by the "other side" to argue that the whole theory should be thrown away for theirs. Some examples include...

  • using stupid priors (and not checking them)
  • using stupid CIs (and not checking them)
  • confusing a computational technique with the theory (bayes is not mcmc!! Same goes for equating cross validation with machine learning)
  • talking about a problem with a specific application with one theory and not how the other theory would solve the specific problem "better"
probabilityislogic
  • 22,555
  • 4
  • 76
  • 97
  • Haha yes this is very true I think. I had to listen to a professor go on for a half hour about how Bayesianism is terrible because coming up with priors subjectively doesn't make sense and the whole time I was thinking "well, duh, so that's why you wouldn't choose a prior that way". My point being, I agree that strawman arguments abound. – Chill2Macht Dec 09 '17 at 15:17
1

So then wouldn't it follow that the only mathematically correct version of statistics is that which refuses to be anything but entirely agnostic with respect to Bayesianism and frequentism? If methods with both classifications are mathematically correct, then isn't it improper practice to prefer some over the others, because that would be prioritizing vague, ill-defined philosophy over precise, well-defined mathematics?

No. It does not follow. Individuals who are unable to feel their emotions are biologically incapable of making decisions, including decisions that appear to have only one objective solution. The reason is that rational decision making depends upon our emotional capacity and our preferences both cognitive and emotional. While that is scary, it is the empirical reality.

Gupta R, Koscik TR, Bechara A, Tranel D. The amygdala and decision making. Neuropsychologia. 2011;49(4):760-766. doi:10.1016/j.neuropsychologia.2010.09.029.

A person who prefers apples to oranges cannot defend this as it is a preference. Conversely, a person who prefers oranges to apples cannot defend this rationally as it is a preference. People who prefer apples will often eat oranges because the cost of apples is too great compared to the cost of oranges.

Much of the Bayesian and Frequentist debate, as well as the Likelihoodist and Frequentist debate, was centered around mistakes of understanding. Nonetheless, if we imagine that we have a person who is well trained in all methods, including minor or no longer used methods such as Carnapian probability or fiducial statistics, then it is only rational for them to prefer some tools over other tools.

Rationality only depends upon preferences; the behavior depends upon preferences and costs.

It may be the case that from a purely mathematical perspective that one tool is better than the other, where better is defined using some cost or utility function, but unless there is a unique answer where only one tool could work, then both the costs and the preferences are to be weighed.

Consider the problem of a bookie considering offering a complex bet. Clearly, the bookie should use Bayesian methods in this case as they are coherent and have other nice properties, but also imagine that the bookie has a calculator only and not even a pencil and paper. It may be the case that the bookie, with the use of his calculator and by keeping track of things in his head can calculate the Frequentist solution and has no chance on Earth to calculate the Bayesian. If he is willing to take the risk of being "Dutch Booked," and also finds the potential cost small enough, then it is rational for him to offer bets using Frequentist methods.

It is rational for you to be agnostic because your emotional preferences find that to be better for you. It is not rational for the field to be agnostic unless you believe that all people share your emotional and cognitive preferences, which we know is not the case.

In short, I don't understand what the mathematical basis is for the Bayesian versus frequentist debate, and if there is no mathematical basis for the debate (which is what Wikipedia claims), I don't understand why it is tolerated at all in academic discourse.

The purpose of academic debate is to bring light to both old and to new ideas. Much of the Bayesian versus Frequentist debate and the Likelihoodist versus Frequentist debate came from misunderstandings and sloppiness of thought. Some came from failing to call out preferences for what they are. A discussion of the virtues of an estimator being unbiased and noisy versus and estimator being biased and accurate is a discussion of emotional preferences, but until someone has it, it is quite likely that the thinking on it will remain muddy throughout the field.

I don't like philosophy, but I do like math, and I want to work exclusively within the framework of Kolmogorov's axioms.

Why? Because you prefer Kolmogorov's to Cox's, de Finetti's or Savage's? Is that preference sneaking in? Also, probability and statistics are not math, they use math. It is a branch of rhetoric. To understand why this may matter consider your statement:

if a method is mathematically correct, then it is valid to use the method when the assumptions of the underlying mathematics hold, otherwise, if it is not mathematically correct or if the assumptions do not hold, then it is invalid to use it.

This is not true. There is a nice article on confidence intervals and their abuse its citation is:

Morey, Richard ; Hoekstra, Rink ; Rouder, Jeffrey ; Lee, Michael ; Wagenmakers, Eric-Jan, The fallacy of placing confidence in confidence intervals, Psychonomic Bulletin & Review, 2016, Vol.23(1), pp.103-123

If you read the different potential confidence intervals in the article, each one is mathematically valid, but if you then evaluate their properties, they differ very substantially. Indeed, some of the confidence intervals provided could be thought of as having "bad" properties, though they meet all of the assumptions in the problem. If you drop the Bayesian interval from the list and focus only on the four Frequentist intervals, then if you do a deeper analysis as to when the intervals are wide or narrow, or constant, then you will find that the intervals may not be "equal" though each meets the assumptions and requirements.

It is not enough for it to be mathematically valid for it to be useful or, alternatively, as useful as possible. Likewise, it could be mathematically true, but harmful. In the article, there is an interval that is at its most narrow precisely when there is the least amount of information about the true location and widest when perfect knowledge or near perfect knowledge exists about the location of the parameter. Regardless, it meets the coverage requirements and satisfies the assumptions.

Math can never be enough.

Dave Harris
  • 6,957
  • 13
  • 21
  • I really like the second article. (The conclusion of the first article was something I had already heard argued in a way that convinced me, so it seemed unnecessary for me to read.) I mostly agree with what you say. To be fair, when I say math, I had more in mind the meaning "applied math" as well as the implicit understanding that the subjects and directions of mathematical research, as well as choices of mathematical axioms, are meant to model observations of the real world. Also, I don't think the second article contradicts what I am saying -- the authors take the common fallacies, phrase – Chill2Macht Dec 10 '17 at 03:41
  • them mathematically (i.e. precisely, rigorously), and then provide counterexamples showing that they are false. What I was trying to say (if I remember correctly about my intentions many months ago), was that if your "philosophy" or "philosophical idea" or whatever can't be phrased/narrowed down to a precise statement, i.e. stated unambiguously, then it is useless to throw around. E.g. frequentists who draw a distinction between MLE (MAP with a flat prior) and other types of objective priors for vague reasons -- if your objection can't be stated in the form of a mathematical axiom, then there – Chill2Macht Dec 10 '17 at 03:44
  • is no good reason to state your objection in the first place, because your objection is too vague as to be falsifiable. Just because statistics is "using math" doesn't mean, in my opinion, that statisticians are justified to be sloppier thinkers than mathematicians. Mathematicians argue all of the time about which mathematical axioms are "worthwhile" or "interesting" to consider, as you point out, based ultimately only on emotional preferences. But these arguments are actually capable of having substance and moving fields forward, because the positions of each side are clearly and unambiguous- – Chill2Macht Dec 10 '17 at 03:46
  • ly stated -- e.g. one can say with clarity that intuitionists reject using the Law of the Excluded Middle, while other mathematicians are content to use it. Note also the fierce debate about the Axiom of Choice. But both the Law of the Excluded Middle and the Axiom of Choice are _precise_ statements which, given other _precise_ assumptions, can be falsified, shown to be falsifiable, proved, etc. (depends on the other assumptions). I.e. what I was trying to argue is that "philosophy"/"emotion" should only come into play to state preferences for different _unambiguous/precise_ _axioms_. As – Chill2Macht Dec 10 '17 at 03:49
  • compared to someone saying "priors are bad", and not giving a mathematical axiom which they believe inference should satisfy, and which choosing a prior could be shown logically to violate. The former is useless, while the latter is constructive, because it gives opponents something concrete to work with, e.g. the opportunity to propose an alternative axiom which to them "seems more reasonable to assume for this problem". This is why I really like the second article you linked to, because it does exactly that- it _"mathematizes"_ false interpretations of CI's, and _proves_ that they are false. – Chill2Macht Dec 10 '17 at 03:52
1

The following is taken from my manuscript on confidence distributions - Johnson, Geoffrey S. "Decision Making in Drug Development via Confidence Distributions." arXiv preprint arXiv:2005.04721 (2020).

In the Bayesian framework the population-level parameter of interest is considered an unrealized or unobservable realization of a random variable that depends on the observed data. This premise has cause and effect reversed. In order to overcome this the Bayesian approach reinterprets probability as measuring the subjective belief of the experimenter. Another interpretation is that the unknown fixed parameter, say theta, was randomly selected from a known collection or prevalence of theta's (prior distribution) and the observed data is used to subset this collection, forming the posterior. The unknown fixed true theta is now imagined to have instead been randomly selected from the posterior. Every time the prior or posterior is updated the sampling frame from where we obtained our unknown fixed true theta under investigation must be changed. A third interpretation is that all values of theta are true simultaneously. The truth exists in a superposition depending on the evidence observed (think Schrodinger's cat). Ascribing any of these interpretations to the posterior allows one to make philosophical probability statements about hypotheses given the data. While the p-value is typically not interpreted in the same manner, it does show us the plausibility of a hypothesis given the data - the ex-post sampling probability of the observed result or something more extreme if the hypothesis is true. This does not reverse cause and effect.

To the Bayesian, probability is axiomatic and measures the experimenter. To the frequentist, probability measures the experiment and must be verifiable. The Bayesian interpretation of probability as a measure of belief is unfalsifiable. Only if there exists a real-life mechanism by which we can sample values of theta can a probability distribution for theta be verified. In such settings probability statements about theta would have a purely frequentist interpretation (see the second interpretation of the posterior above). This may be a reason why frequentist inference is ubiquitous in the scientific literature.

The interpretation of frequentist inference is straight forward for non-statisticians by citing confidence levels, e.g. 'We are 15.9% confident that theta is less than or equal to theta_0.' Of course to fully appreciate this statement of confidence one must more fully define the p-value as a frequency probability of the experiment if the hypothesis is true (think of the proof by contradiction structure a prosecutor uses in a court room setting, innocent until proven guilty). A Bayesian approach may make it easy for some to interpret a posterior probability, e.g. 'There is 17.4% Bayesian belief probability that theta is less than or equal to theta_0.' Of course to fully appreciate this statement one must fully define Bayesian belief and make it clear this is not a verifiable statement about the actual parameter, the hypothesis, nor the experiment. If the prior distribution is chosen in such a way that the posterior is dominated by the likelihood or is proportional to the likelihood, Bayesian belief is more objectively viewed as confidence based on frequency probability of the experiment. In short, for those who subscribe to the frequentist interpretation of probability, the confidence distribution summarizes all probability statements about the experiment one can make. It is a matter of correct interpretation given the definition of probability and what constitutes a random variable. The posterior remains an incredibly useful tool and can be interpreted as a valid asymptotic confidence distribution. The frequentist framework can easily incorporate historical data through a fixed-effect meta-analysis.