Frequentist statistics

Question

Frequentist inference is the only form of statistics taught in my department, and I feel like it has a strong hold over many students here. But when I read data science blogs, I get the feeling that frequentist methods, ANOVAs, t-tests etc. are really looked down upon, which is in stark contrast to all the other graduate students around me

I understand that a problem with frequentist methods is the strong assumptions that get made (which vary depending on the specific method). This is in essence saying you need clean data and that is very rarely the case in real life

Can anyone provide some real life examples of where/why frequentist methods would fail? I'm looking for some strong arguments that I could make against those who hold such a strong pro belief in my department

For some weaknesses of the frequentist approach, you may look for (many are available on CV) posts discussing differences between Bayesian and frequentist approaches — Christoph Hanck, Aug 20 '15 at 06:27
are the Bayesian methods akin to methods/tools used in the data science/analysis fields in real world job markets? — Simon, Aug 20 '15 at 06:37
There definitely is quite some overlap. Many people argue, as you will see, that the Bayesian approach much more naturally allows to answer the questions we should really be interested in. That said, the Bayesian approach has very solid philosophical underpinnings, and the discussion thereof of course tends to be quite far away from applications. — Christoph Hanck, Aug 20 '15 at 06:42
Frequentist methods fail in so many ways it's difficult to catalog. I've made a start at fharrell.com/post/journey . The biggest reason to go Bayes is to get direct evidence in favor of an assertion rather than indirect evidence against an assertion. Also, most frequentist procedures are approximate and we are unable to get exact p-values and confidence intervals. — Frank Harrell, Nov 13 '21 at 14:35
I find these frequent frequentist vs Bayesian discussions artificially created by academics to prop up research funding. Practitioners use whatever they know or works. As a junior member of the team it is sometimes practical to follow the path of least resistance and do what the group is doing, whether it's Bayesian or not is inconsequential. You'll get the same result in the end with least amount of wasted time on pointless discussions. In other cases it may help you follow an opposing view to the prevailing in the group to differentiate yourself. In any case the choice is not dictated by ... — Aksakal, Nov 13 '21 at 16:41
the merit of approach, but by internal politics and your career building strategy. For instance, when I join a new team I don't insist on using my favorite tools, but do the work with what the team is already using. I will introduce new tools later, of course, to differentiate myself. However, that's just my approach, and you may do it differently. shaking a boat can work for you. It does for some, but it's a high risk strategy. It's more reliable to build your reputation points first by delivering tangible results, within whatever framework is imposed on you. They all work the same — Aksakal, Nov 13 '21 at 16:45

score 5 · Accepted Answer · answered Nov 10 '21 at 22:24

It sounds like your objection is to the use of complete datasets, rather than an objection to the statistical methodology used on those datasets. While you are correct that datasets used in university courses are usually much cleaner than real-life datasets (in particular, they often do not have missing data), this is for pedagogical reasons --- the goal of using clean datasets is to focus attention on the statistical methods under consideration in the course, without adding additional complications.

Classical statistical methods (which you refer to as "frequentist" here) are capable of dealing with missing data using imputation methods. These methods are reasonably complicated and they would tend to side-track the analysis if used heavily in introductory statistical courses. Nevertheless, they can ---in principle--- be added on to any of the classical statistical methods. There is something of an "opening wedge" here to argue against classical methods and in favour of Bayesian methods. Multiple imputation can itself be regarded as a kind of numerical version of Bayesian analysis, and one could reasonably mount the argument that allowing multiple imputation is essentially admitting the reasoning behind Bayesian statistics. This is a complicated argument, but it is one you could investigate.

As to whether or not classical statistics makes strong assumptions, that depends on what models you are talking about and what you compare them to. At one extreme you can use non-parametric models, which have very few assumptions, and at the other extreme you can use highly specific parametric models, which may involve strong assumptions. Classical statistics is a broad field and it has models at varying levels of generality and with varying levels of detail in the assumptions.

There are a few tricky "paradoxes" in probability and statistics, and some of them are difficult to deal with without using Bayesian methods. The "shooting-room paradox" is particularly difficult to deal with without some kind of Bayesian prior specification over a set of difficult conditioning events (see e.g., Bartha and Hitchcock 1999). Challenges to the classical paradigm have often come from Bayesian statisticians who were unsatisfied with the ability of that paradigm to deal with tricky paradoxes, so this is also something you could investigate.

The shooting room paradox, like the sleeping beauty paradox and many others, can be addressed in the frequentist paradigm by considering predictive p-values and frequentist prediction intervals. These scenarios seem paradoxical because for a single roll of the dice or flip of a coin we have multiple opportunities to make a type I error (multiple tests). We can consider proportions (p-values) and error rates over these repeated tests to conduct predictive inference in these scenarios. Let me know if I am mistaken. — Geoffrey Johnson, Nov 13 '21 at 02:46
@GeoffreyJohnson: You may well be right, but I really can't tell without a reference to the actual argument. (It's not something I've seen before.) Can you give a reference to the frequentist treatment that solves these problems? — Ben, Nov 13 '21 at 03:37
I tried my hand at The Sleeping Beauty Paradox [here](https://stats.stackexchange.com/questions/41208/the-sleeping-beauty-paradox/551422#551422). Let me know if you see room for improvement. — Geoffrey Johnson, Nov 13 '21 at 03:39

Geoffrey Johnson · Answer 2 · 2021-11-13T14:42:23.990

As Ben mentioned there are often a variety of methods available to approach a problem (including missing data), each with a different set of assumptions (frequentist or not). I realize your question is asking about where frequentist methods will fail, but I will provide my rationale for why many adhere to frequentist methods even when presented with the opportunity to "go Bayesian." I'm not writing this to provoke anyone, I'm simply showing the thought process that some in your department may have so that you are prepared for a scientific discussion.

To the frequentist, population-level quantities (typically denoted by greek characters) are fixed and unknown because we are unable to sample the entire population. If we could sample the entire population we would know the population-level quantity of interest. In practice we have a limited sample from the population, and the only thing one can objectively describe is the operating characteristics of an estimation and testing procedure. Understanding the long-run performance of the estimation and testing procedure is what gives the frequentist confidence in the conclusions drawn from a single experimental result. What the experimenter or anyone else subjectively believes before or after the experiment is irrelevant since this belief is not evidence of anything. Beliefs and opinions are not facts. If the frequentist has historical data ("prior knowledge") this can be incorporated in a meta-analysis through the likelihood and does not require the use of belief probabilities regarding parameters. If fixed population quantities are treated as random variables this can introduce bias in estimation and inference.

To the Bayesian, probability is axiomatic and measures the experimenter. The Bayesian interpretation of probability as a measure of belief is unfalsifiable $-$ it is not a verifiable statement about the actual parameter, the hypothesis, nor the experiment. It is a statement about the experimenter. Who can claim to know the experimenter's beliefs better than the experimenter? If the prior distribution is chosen in such a way that the posterior is dominated by the likelihood or is proportional to the likelihood, Bayesian belief is more objectively viewed as confidence based on frequency probability of the experiment.

A common example used to promote Bayesian statistics and discourage the use of p-values involves a screening test for cancer or COVID and a disease prevalence. Here is a LinkedIn article on the topic showing the internal contradictions of such an approach. Another common example used to promote Bayesian statistics and discourage the use of p-values involves incorporating "prior knowledge." As mentioned earlier, if performed objectively this "prior knowledge" is simply the likelihood from a historical study which can easily be incorporated through a frequentist meta-analysis. Other examples used to promote the Bayesian paradigm involve predictive inference. Such predictive inference is possible under the frequentist paradigm using predictive p-values and prediction intervals.

I find confidence curves to be a particularly useful way to visualize frequentist inference, analogous to Bayesian posterior distributions. Here are some threads that demonstrates this [1] [2].

This is an interesting answer but I wonder if it mightn't be more suited to another question (& linked to from this one). — Scortchi - Reinstate Monica, Nov 13 '21 at 11:26
I don't understand the idea that the Bayesian experimenter's prior can never be disputed. For example, if Bob's prior is that the Monster Raving Loony Party will win the next general election with probability > 99%, then Bob's prior is obviously stupid. — fblundun, Nov 13 '21 at 12:31
@Scortchi, I think it is relevant here because the OP indicated they were unclear why so many in their department feel strongly about frequentist inference. My answer is intended to shine light on this. If the OP is going to enter into a scientific discussion and argue against frequentism in favor of Bayesianism they should have a strong understanding of the counter arguments. — Geoffrey Johnson, Nov 13 '21 at 14:10
@ fblundun, 99% is Bob's opinion and you have yours. We could entertain other hypotheses but we would still be debating yours and Bob's opinion. In order to move the debate away from opinion we must discuss the operating characteristics of an experiment. This is all that we can objectively describe. Your prior can be seen as a collection of $\theta$'s you gave yourself, and the posterior is a reduced collection of $\theta$'s you gave yourself. How your collection relates to the unknown fixed true $\theta$ and why your collection is better than someone else's is difficult to defend. — Geoffrey Johnson, Nov 13 '21 at 14:18
Are you really saying that you don't think the opinion "The MRLP probably won't win the next general election" is any better than the opinion "The MRLP probably will win the next general election"? — fblundun, Nov 13 '21 at 16:42
I am saying they are both opinions and I can have my own opinion about these two opinions. However, none of this is evidence. If we have results from a poll based on a sample from the voting population we can objectively discuss the operating characteristics of this poll when using it to predict the results of a future election. — Geoffrey Johnson, Nov 13 '21 at 17:22
If someone has strong opinions without direct evidence this can be seen as transfer learning. This is where a likelihood used to investigate a different parameter is transferred to the new problem and it is assumed that the unknown fixed true parameters are equal. — Geoffrey Johnson, Nov 13 '21 at 17:27
So you actually don't believe that "The MRLP probably won't win the next general election"? — fblundun, Nov 13 '21 at 19:07

Dave Harris · Answer 3 · 2021-11-14T05:09:03.350

This post is about 2800 words long in order to handle the response to comments. It looks much larger due to the size of the graphics. About half the post in length is graphics. Nonetheless, a comment makes mention that with my edit, the whole is difficult to consume. So what I am doing is providing an outline and a restructuring to make it easier to know what to expect.

The first section is a brief defense of the use of Frequentist methods. All too often in these discussions people bash one tool for another. The second is a description of a game where Bayesian method guarantee the user of Frequentist methods takes a loss. The third section explains why that happens.

A DEFENSE OF FREQUENTISM

Pearson and Neyman originated statistics are optimal methods. Fisher's method of maximum likelihood is an optimal method. Likewise, Bayesian methods are optimal methods. So, given that they are all optimal in at least some circumstances, why prefer non-Bayesian methods to Bayesian ones?

First, if the assumptions are met, the sampling distribution is a real thing. If the null is true, the assumptions hold, the model is the correct model and if you could do things such as infinite repetition, then the sampling distribution is exactly the real distribution that nature would create. It would be a direct one-to-one mapping of the model to nature. Of course, you may have to wait an infinite amount of time to see it.

Second, non-Bayesian methods are often required by statute or regulation. Some accounting standards only are sensible with a non-Bayesian method. Although there are workarounds in the Bayesian world for handling a sharp null hypothesis, the only type of inferential method that can properly handle a hypothesis such as $H_0:\theta=k$ with $H_A:\theta\ne{k}$ is a non-Bayesian method. Additionally, non-Bayesian methods can have highly desired properties that are unavailable to the Bayesian user.

Frequentist methods provide a guaranteed maximum level of false positives. Simplifying that statement, they give you a guarantee of how often you will look like a fool. They also permit you to control against false negatives.

Additionally, Frequentist methods, generally, minimize the maximum amount of risk that you are facing. If you lack a true prior distribution, that is a wonderful thing.

As well, if you need to do transformations of your data for some reason, the method of maximum likelihood has wonderful invariance properties that usually are absent from a Bayesian method.

Problematically, Bayesian calculations are always person-specific. If I have a highly bigotted prior distribution, it can be the case that the data collected is too small to move it regardless of the true value. Frequentist methods work equally well, regardless of where the parameter sits. Bayesian calculations do not work equally well over the parameter space. They are best when the prior is a good prior and worst when the prior is far away.

Finally, Bayesian reasoning is always incomplete. It is inductive. For example, models built before relativity would always be wrong about things that relatively impacts. A Frequentist test of Newtonian models would have rejected the null in the edge cases such as the orbit of Mercury. That is complete reasoning. Newton is at least sometimes wrong. It is true you still lack a good model, but you know the old one is bad. Bayesian methods would rank models and the best model would be a bad model. Its reasoning is incomplete and one cannot know how it is wrong.

Now let us talk about when Bayesian methods are better than Frequentist methods. There are three places where that happens, except when it is required by some rule such as an accounting standard.

The first is when you are needing to update your beliefs or your organization's beliefs. Bayesian methods separate Bayesian inference from Bayesian actions. You can infer something and also do nothing about it. Sometimes we do not need to share an understanding of the world by agreeing on accepting a convention like a t-test. Sometimes I need to update what I think is happening.

The second is when real prior information exists but not in a form that would allow something like a meta-analysis. For example, people investing in riskier assets than bonds should anticipate receiving a higher rate of return than bonds. If you know the nominal interest rate on a bond of long enough duration, then you should anticipate that actors in the market are attempting to earn more. Your prior should reflect that is improbable that the center of location for stocks should be less than the return on bonds. Conversely, it is very probable that is greater, but not monumentally greater either. It would be surprising for a firm to be discounted in a competitive market to a 200% per year return.

The third reason is gambling. That is sort of my area of expertise. My area can be thought of as being one of two things. The first is the study of the price people require to defer consumption. The second would be the return required to cover a risk.

In the first version, buying a two-year-old a birthday present in order to see them smile next week is an example of that. It is a gamble. They may fall in love with the box and ignore the toy breaking our hearts and making them happy. In the second, we consider not only the raw outcome but the price of risk. In a competition to own or rid oneself of risk, prices form.

In a competitive circumstance, the second case and not the first, only Bayesian methods will work because non-Bayesian methods and some Bayesian methods are incoherent. A set of probabilities are incoherent if I can force a middleman such as a market maker or bookie to take a loss.

All Frequentist methods, at least some of the time, when used with a gamble can cause a bookie or market maker to take a loss. In some cases, the loss is total. The bookie will lose at every point in the sample space.

I have a set of a half-dozen exercises that I do for this and I will use one below. Even though the field of applied finance is Frequentist, it should not be. See the third section for the reason.

THE EXAMPLE

As you are a graduate statistics student, I will drop the story I usually tell around the example so that you can just do the math. In fact, this one is very simple. You can readily do this yourself.

Choose a rectangle in the first quadrant of a Cartesian plane such that no part of the rectangle touches either axis. For the purposes of making the problem computationally tractable, give yourself at least some distance from both axes and do not make it insanely large. You can create significant digit issues for yourself.

I usually use a rectangle where neither $x$ nor $y$ is less than 10 and nothing is greater than 100, although that choice is arbitrary.

Uniformly draw a single coordinate pair from that plane. All the actors know where the rectangle is at so you have a proper prior distribution with no surprises. This condition exists partly to ground the prior, but also because there exist cases where improper priors give rise to incoherent prices. As the point is only to show differences exist and not to go extensively into prior distributions, a simple grounding is used.

The region doesn't have to be a rectangle. If you hate yourself, make a region shaped like an outline of the Statue of Liberty. Choose a bizarre distribution over it if you feel like it. It might be unfair to Frequentist methods to choose a shape that is relatively narrow, particularly one with something like a donut hole in it.

On that rectangle will be placed a unit circle. There is nothing special about a circle, but unless you hate yourself, make it a circle that is small relative to the rectangle. If you make it too small, again, you could end up with significant digit issues.

You will be the bookie and I will be the player. You will use Frequentist methods and I will use Bayesian methods. I will pay an upfront lump sum fee to you to play the game. The reason is that a lump sum is a constant and will fall out of any calculations about profit maximization. Again, if you hate yourself, do something else.

You agree to accept any finite bet that I make, either short or long, at your stated prices. You also agree to use the risk-neutral measure. In other words, you will state fair Frequentist odds. Your sole source of profit is your fee, in expectation. We can assume that you have nearly limitless pocket depth compared to my meager purse.

The purpose of this illustration is to illustrate an example of how a violation of the converse of the Dutch Book Theorem, or the Dutch Book Theorem, assures bad outcomes. You can arbitrage any Frequentist pricing game, though not necessarily in this manner.

The unit circle is in a position unknown to either of us. The unit circle will emit forty points drawn uniformly over the circle. You will draw a line from the origin at $(0,0)$ through the minimum variance unbiased estimator of the center of the circle. The line is infinitely long so you will cut the disk into two pieces.

We will gamble whether the left side or the right side is bigger. Because the MVUE is guaranteed to be perfectly accurate by force of math, you will offer one-to-one odds for either the left side or the right side. How will I win?

As an aside, it doesn't matter if you convert this to polar coordinates or run a regression forcing the intercept through the origin. The same outcome ends up happening.

So first understand what a good and bad cut would look like.

In this case, the good cut is perfect. Every other cut is in some sense bad.

Of course, neither you nor I get to see the outline. We only get to see the points.

The Frequentist line passes through the MVUE. Since the distribution of errors are symmetric over the sample space, one-to-one odds should not make you nervous.

It should make you nervous, though. All of the information about the location of the disk comes from the data alone with the Frequentist method. That implies that the Bayesian has access to at least a trivial amount of extra information. So I should win at least on those rare happenings where the circle is very near to the edge and most of the points are outside the rectangle. So I have at least a very small advantage, though you can mitigate it by making the rectangle comparatively large.

That isn't the big issue here. To understand the big issue, draw a unit circle around the MVUE.

You now know, for sure, that the left, upper side is smaller than the opposite with perfect certainty. You can know this because some points are outside the implied circle. If I the Bayesian can take advantage of that, then I can win anytime the MVUE sits in an impossible place. Any Frequentist statistic can do that. Most commonly, it happens when the left side of a confidence interval sits in an impossible location as may happen when it is negative for values that can only be positive.

The Bayesian posterior is always within one unit of every single point in the data set. It is the grand intersection of all the putative possible circles drawn around every point. The green line is the approximation of the posterior, though I think the width of the green line might be a bit distorting. The black dot is the posterior mean and the red dot the MVUE.

The black circle is the Bayesian circle and the red circle the Frequentist one. In the iteration of this game that was used to make this example, I was guaranteed a win 48% of the time and won roughly 75% of the remaining time from the improved precision. If you make Kelly Bets for thirty rounds, you make about 128,000 times your initial pot.

Under Frequentist math, you expected to see this distribution of wins over thirty rounds.

The Bayesian player expects to win under this distribution.

Technical Aside

It is not sufficient for the MVUE to be outside the posterior. It must also be outside the marginal posterior distribution of the slope. There do exist circumstances where the Frequentist line is possible even though the Frequentist point is not. Imagine the Bayesian posterior as a cone from the origin. The MVUE can be outside the posterior but inside the cone. In that circumstance, you bet the Kelly Bet based on the better precision of the Bayesian method. Also, improper priors can also lead to incoherence.

Note On Images

The boxes in the images were to make the overall graphic look nice. It wasn't the boundary I actually used.

WHY THIS HAPPENS

I have a half-dozen of these examples related to market trading rules. Games like this are not that difficult to create once you notice that they exist. A cornucopia of real-world examples exists in finance. I would also like to thank you for asking the question because I have never been able to use the word cornucopia in a sentence before.

A commentator felt that the difference was due to allowing a higher level of information by removing the restriction on an estimator being unbiased. That is not the reason. I have a similar game that uses the maximum likelihood estimator and it generates the same type of result. I also have a game where the Bayesian estimator is a higher variance unbiased estimator and it also leads to guaranteed wins. The minimum variance unbiased estimator is precisely what it says it is. That does not also imply that it is coherent.

Non-Bayesian statistics, and some Bayesian statistics, are incoherent. If you place a gamble on them, then perfect arbitrage can be created at least some of the time. The reason is a bit obscure, unfortunately, and goes to foundations. The base issue has to do with the partitioning of sets in probability theory. Under Kolmogorov’s third axiom, where $E_i$ is an event in a countable sequence of disjoint sets $$\Pr\left(\cup_{i=1}^\infty{E_i}\right)=\sum_{i=1}^\infty\Pr(E_i),$$ we have at least a potential conflict with the Dutch Book Theorem. The third result of the Dutch Book Theorem is $$\Pr\left(\cup_{i=1}^N{E_i}\right)=\sum_{i=1}^N\Pr(E_i),N\in\mathbb{N}.$$ It turns out that there is a conflict.

If you need to gamble, then you need sets that are finitely but not countably additive. Furthermore, in most cases, you also need to use proper prior distributions. Any Frequentist pricing where there is a knowledgeable competitor leads to arbitrage positions. It can take quite a while to figure out where it is at, but with effort it can be found. That includes asset allocation models when when $P=P(Q)$. Getting $Q$ wrong shifts the supply or demand curves and so gets $P$ wrong.

There is absolutely nothing wrong with the unbiased estimators in the example above. Unbiased estimators do throw away information, but they do so in a principled and intelligently designed manner. The fact that they produce impossible results in this example is a side-effect anyone using them should be indifferent to. They are designed so that all information comes from the data alone. That is the goal. It is unfair to compare them to a Bayesian estimator if your goal is to have all information come from the data. The goal here isn’t scientific; it is gambling. It is only about putting money at risk.

The estimator is only bad in the scientific sense because we have access to information from outside the data that the method cannot use. What if we were wrong and the Earth was round, angels do not push planets around and spirits do not come to us in our dreams? Sometimes not using outside knowledge protects science. In gambling, that is a bad idea. A horse that is a bad mudder is important information to include if it just rained, even if that is not in your data set.

This example is primarily to show that it can be done and produces uniquely differing results. Real prices often sit outside the dense region of the Bayesian estimation. Warren Buffet and Charlie Munger have been able to tell you that since before I was born, as was Graham and Dodd before them. They just were not approaching it in the framework of formal probability. I am.

It is the interpretation of probability that is the problem, not the bias or lack thereof. Always choose your method for fitness of purpose, not popularity. Our job is to do a good job, not be fashionable.

Why use the MVUE? With frequentist statistics one can just as well add bias to improve the overal result. In the end it boils down to the bias improving the results. That's a simple win if you apply it to one method and not to the other. It is also sort of cheating when your bias is based on perfect information. Biased methods can reduce the variance of the error in the estimate (or reduce other performance measures), but the bias will only work better when the bias is 'correct'. Often some form of cross-validation is applied, to test the bias, and this makes it frequentist again. — Sextus Empiricus, Nov 11 '21 at 08:38
Dave have you ever seen a sampling distribution? Why do you say it's real? — Frank Harrell, Nov 13 '21 at 14:33
You need to structure your answer so that it is consumable at its current length — Aksakal, Nov 13 '21 at 14:39
@Aksakal I will print it out and try and work on it at lunches over the next few weeks. I am heavily involved in my area's COVID response. We have reached the point where ambulances arrive at homes, assess and leave without patients if they do not fit criteria. I am thankful I am not a pulmonologist, respiratory therapist or pediatrician. The edit does make it a bit higgledy-piggledy. — Dave Harris, Nov 13 '21 at 18:21
@FrankHarrell give me an infinite amount of time, a perfect model, and no heat death for the universe, and I can show it to you. Well, except for two-dimensional Brownian motion because there will be holes in the distribution once we have made it to infinity. — Dave Harris, Nov 13 '21 at 18:23

Frequentist statistics

3 Answers3