15

I have often heard that in certain instances, it can be more beneficial to use Bayesian based methods because they provide "a distribution of possible answers" (i.e. the posterior distribution) instead of a single answer (as done in the frequentist case). However, it seems that at the end of the day, the analyst is still required to transform this "distribution of possible answers" into a single answer.

For example : if a Bayesian model is used to estimate the posterior distribution of "mu", the analyst is still required to either take the MAP or the Expectation of this distribution to return a final answer.

Is this the main benefit of Bayesian models? If the priors are correctly specified, the credible intervals associated with the expectation of the posterior distribution (of the parameter of interest) are more reliable?

Alexis
  • 26,219
  • 5
  • 78
  • 131
stats_noob
  • 5,882
  • 1
  • 21
  • 42
  • 6
    Why is the analyst is required to either take the MAP or the Expectation of the posterior distribution? It all depends on the question being asked: for example a credible interval may be what is wanted, or something that feeds into a later price of analysis using the full posterior distribution. – Henry Oct 12 '21 at 10:30
  • 2
    A possible advantage of a credible interval is that it may be closer to the intuitive interpretation of what such an interval represents than a frequentist confidence interval is – Henry Oct 12 '21 at 10:30
  • 2
    As I mention in [this recent answer](https://stats.stackexchange.com/questions/547367/advantages-of-bayesian-methods-for-parameter-estimation/547368#547368), given a posterior distribution and a loss (or utility) function, you can derive an optimal point estimate / an optimal decision. This is a benefit, because without a distribution you cannot do that. – Richard Hardy Oct 13 '21 at 14:39
  • Regarding the need for a point versus a distribution as "the final answer", see [this thread](https://stats.stackexchange.com/questions/351897). – Richard Hardy Oct 13 '21 at 14:55

4 Answers4

17

First of all, Frequentist methods also provide a distribution over possible answers. It is just that we do not call them distributions because of a philosophical point. Frequentists consider parameters of a distribution as a fixed quantity. It is not allowed to be random; therefore, you cannot talk about distributions over parameters in a meaningful way. In frequentist methods, we estimate confidence intervals which can be thought of as distributions if we are letting go of the philosophical details. But in Bayesian methods the fixed parameters are allowed to be random; therefore, we talk about the (prior and posterior) distributions over the parameters.

Second, it is not always the case that only a single value is used at the end. Many applications require us to use the entire posterior distributions in subsequent analysis. In fact, to derive a suitable point estimate, full distribution is required. A well known example is risk minimization. Another example is model identification in natural sciences in the presence of significant uncertainties.

Third, Bayesian inference has many benefits over a frequentist analysis (not just the one that you metion):

  1. Ease of interpretation: It is hard to understand what a confidence interval is and why it is not a probability distributions. The reason is simply a philosophical one as I have explained above briefly. The probability distributions in Bayesian inference are easier to understand becuase that is how we typically tend to think in uncertain situations.

  2. Ease of implementation: It is easier to get Bayesian probability distributions than frequentist confidence intervals. Frequentist analysis requires us to identify a sampling distribution which is very difficult for many real world applications.

  3. Assumptions of the model are explicit in Bayesian inference: For example, many frequentist analyses assume asymptotic Normality for computing the confidence interval. But no such assumptions are required for Bayesian inference. Moreover, the assumptions made in Bayesian inference are more explicit.

  4. Prior information: Most importantly, Bayesian inference allows us to incorporate prior knowledge into the analyses in a relatively simple manner. In frequentist methods, regularization is used to incorporate prior information which is very difficult to do in many problems. It is not to say that incorporation of prior information is easy in Bayesian analysis; but it is easier than that in frequentist analysis.

Edit: A particularly good example of ease-of-interpretation of Bayesian methods is their use in probabilistic machine learning (ML). There are several method developed in ML literature with the backdrop of Bayesian ideas. For example, relevance vector machines (RVMs), Gaussian processes (GPs).

As Richard hardy pointed, this answer gives the reasons why someone would want to use Bayesian analysis. There are good reasons to use frequentist analysis also. In general, frequentist methods are computationally more efficient. I would suggest reading first 3-4 chapters of 'Statistical Decision Theory and Bayesian Analysis' by James Berger which gives a balanced view on this issue but with an emphasis on Bayesian practice.

To elaborate on the use of entire distribution rather a point estimate to make a decision in risk minimization, a simple example follows. Suppose you have to choose between different parameters of a process to make a decision, and the cost of choosing wrong parameters is $L(\hat{\theta},\theta)$ where $\hat{\theta}$ is the parameter estimate and $\theta$ is assumed to be true parameter. Now given the posterior distribution $p(\hat{\theta}|D)$ (where $D$ denotes observations)we can minimize expected loss which is $\int L(\hat{\theta},\theta)p(\hat{\theta}|D)d\hat{\theta}$. This expected loss can be minimized for every value of $\theta$ and the $\theta$ value with minimum expected loss can be used for decision making. This will result in a point estimate; but the value of the point estimate depends upon the loss function.

Based on a comment by Alexis, here is why frequentist confidence intervals are harder to interpret. Confidence intervals are (as Alexis has pointed out): A plausible range of estimates for a parameter given a Type I error rate. One naturally asks where does this possible range come from. The frequentist answer is that it comes from the sampling distribution. But then the question is we only observe one sample? The frequentist answer is we infer what other samples could have been observed based on the likelihood function. But if we are inferring other samples based on likelihood function, those samples should have a probability distribution over them, and, consequently, the confidence interval should be interpreted as a probability distribution. But for the philosophical reason mentioned above, this last extension of probability distribution to confidence interval is not allowed. Compare this to a Bayesian statement: A 95% credible-region means that the true parameter lies in this region with 95% probability.

A side note on philosophical differences between Bayesian and frequentist theory (based on a comment by ): In frequentist theory probability of an event is relative frequencies of that event over a large number of repeated trials of the experiment in question. Therefore, the parameters of a distribution are fixed because they stay the same in all the repetitions of the experiment. In Bayesian theory, the probabilities are degrees of belief in that an event would occur for in a single trial of the experiment in question. The problem with frequentist definition of probability is that it cannot be used to define probabilities in many real world applications. As an example, try to define the probability that I am typing this answer an android smartphone. Frequentist would say that the probability is either $0$ or $1$. While the Bayesian definition allows you to choose an appropriate number between $0$ and $1$.

Abhinav Gupta
  • 1,511
  • 8
  • 23
  • 5
    "**It is hard to understand what a confidence interval is**" A plausible range of estimates for a parameter given a Type I error rate. – Alexis Oct 13 '21 at 04:30
  • 1
    Interesting also, but perhaps a side note is that the philosophical discussion of what "probability" means is significantly different between the approaches - and much of the reason we end up with the two ways of calculating the future estimate. – Stian Yttervik Oct 13 '21 at 10:17
  • 3
    You are making some good points, but... Re 2: I think ease of implementation can go both ways and depends on the problem. It is often easier to just run an OLS regression and obtain confidence intervals than to specify priors and likelihoods required in Bayesian analysis. Re 3: asympt. normality is derived, not assumed. We assume some conditions (such as i.i.d. observations) and then asymptotic normality follows from the CLT. Re 4: I think regularization is often easier to do than specifying the priors. All in all, I do not think your answer gives a balanced view. – Richard Hardy Oct 13 '21 at 14:53
  • 1
    Regarding the need for a point versus a distribution as "the final answer", could you elaborate a bit on what you mean by the well known example of risk minimization? – Richard Hardy Oct 13 '21 at 14:56
  • @RichardHardy It is not a balanced view at all. The question is why to use Bayesian estimates; therefore, I just focused on that. I agree that there are many good reasons to use frequentist analysis. OLS is an example. The problem is what if data is significantly Non-Gaussian? For example, if it is Laplacian, using OLS would be a mistake. In this case, frequentist analysis becomes harder. As per your comments, I will elaborate on risk minimization. Also, I will add you example of OLS in the answer. – Abhinav Gupta Oct 13 '21 at 16:14
  • @StianYttervik Yes, philosophical differences are at the heart of the differences between Bayeisan and frequentist analysis. Probability means different things to different people. Bayesian thinking is more intuitive and more accessible I guess. – Abhinav Gupta Oct 13 '21 at 16:17
  • @RichardHardy Asymptotic Normality of Likelihood functions can be derived only under certain conditions. The problem is these conditions are often opaque to users. – Abhinav Gupta Oct 13 '21 at 16:30
  • 2
    By *not a balanced view* I meant something else. I meant that your comparisons stated as they are rather generally are debatable and possibly misleading. They only apply to some cases while opposite statements apply to other cases. Thus you are cherry picking the cases to support your claims; this is what I would call *unbalanced*. Regarding assumptions, I do not think frequentist assumptions are more opaque than Bayesian or vice versa. Properties of Bayesian estimators cannot be derived without assumptions, just like properties of frequentist estimators cannot be derived without assumptions. – Richard Hardy Oct 13 '21 at 16:36
  • Using OLS with Laplacian errors would not be a mistake because OLS does not make distributional assumptions (distributional assumptions are only needed for small sample inference but not for a number of other uses of the estimator); OLS is still BLUE. Meanwhile, assuming a Gaussian likelihood when the truth is Laplacian would invalidate Bayesian estimation, would it not? This is a good example where a frequentist method makes fewer assumptions and is more robust than a Bayesian method. – Richard Hardy Oct 13 '21 at 16:51
  • @RichardHardy Oops, I had something else in mind while thinking about OLS. Yes, you are right. OLS can be used with Laplacian error too. But the to find confidence intervals you will need to be careful. When truth is Laplacian, why would I use Gaussian assumptions. I would just use Laplacian likelihood. This is where Bayesian methods are explicit about assumptions. Also, When one knows that errors are Laplacian, it is better minimize absolute errors than least square errors. – Abhinav Gupta Oct 13 '21 at 17:33
  • 2
    That's true. My point was that OLS was robust to violation of a distribution assumption (or more precisely, it did not make one in the first place) while a standard Bayesian estimator was not. And we know that assumption violations are ubiquitous in practice. – Richard Hardy Oct 13 '21 at 17:34
  • @Alexis I think it's worth pointing out that situations can arise where it is not plausible that the parameter lies in the confidence interval. – fblundun Oct 13 '21 at 18:43
  • 2
    @fblundun That's right. I quite agree. The assumptions that go with any statistical method should not be ignored. There are also situations in which the prior informing a credible interval is both deeply unrealistic, and overwhelms the data: this doesn't mean one shouldn't use Bayesean credible intervals. :) – Alexis Oct 13 '21 at 23:44
  • @Alexis That definition is harder to understand than that of a credible region. I have expanded on that in my answer. – Abhinav Gupta Oct 14 '21 at 05:40
9

I can't give a de jure benefit of Bayesianism, but I can offer some examples of how I find Bayesianism beneficial as compared to frequentism.

That the results of a Bayesian analysis is a posterior distribution and not a point estimate allows the analyst to perform some very straightforward calculations in order to perform decision analysis. As I explain here, the posterior can be used to estimate the expected loss of any decision (assuming a cost function is specified) simply by taking averages of samples obtained via MCMC techniques. This assumes one has a good model readily available (perhaps a benefit, perhaps a detriment depending on where you stand) but I can't underscore just how simple the calculations can be.

From some of the points you make in your post, it sounds like you're caught on the fact that people still want a single number (the expectation of the posterior, for example, or the MAP of the parameters). The sample mean implies a particular cost function (the sample mean for example minimizes the sum of squared errors). But, if you wanted some other cost structure, then with Bayesianism you're free to use an estimator which caters to your needs, as I do in the link above.

Demetri Pananos
  • 24,380
  • 1
  • 36
  • 94
  • 2
    (+1) OP may be interested in reading on [decision analysis](https://en.wikipedia.org/wiki/Decision_analysis) - often the main point of paying people to perform statistical analysis is because an organisation wants to optimise their decision-making based on that data. Plugging the results from a Bayesian analysis into your decision analysis procedure (often [Monte Carlo based](https://web.archive.org/web/20210310120627/https://www.wrike.com/project-management-guide/faq/what-is-monte-carlo-analysis-in-project-management/)) is common in healthcare, business etc & as you say, works very naturally – Silverfish Oct 13 '21 at 14:49
8

There isn't an answer to your question.

It is true that there are circumstances where a Bayesian solution is intrinsically preferable to a Frequentist solution. The reverse is also true.

The main benefit of a Bayesian model is that it updates and improves your beliefs about the world. Other than that, the two systems are not comparable. They solve different questions.

A posterior distribution, if you really are using your real priors, should become your new prior. It becomes your default understanding of the world with respect to the parameters and future data.

If you are using it in analysis for a third party, then it should be specified by their prior distributions and not yours. You are not updating your beliefs.

All three main methods of constructing estimators, the method of maximum likelihood, Frequency-based estimators such as the minimum variance unbiased estimator, and Bayesian estimators are optimal estimators. They are optimal under different criteria.

If you woke up one morning and assuming that one or more categories of estimation were not foreclosed by the nature of the problem, and needed a point estimate, then the solution is to answer what you mean when you say a point is optimal.

You should answer a variety of questions. Who needs the point? Why do they need the point? What happens to that third party if you choose the wrong point? Does it actually need to be a point? Could an interval or a distribution do as good or a better job?

I think there is another difference that you might be missing. A Frequentist interval or point is working in the sample space. The distributions that are implicit or explicit in the process, such as the sampling distribution of Student's t statistic, are not distributions of beliefs. For statistics, they are the long-run distribution that you would expect to see while collecting samples over the sample space. They represent possible, but maybe forever unrealized, outcomes that could happen.

The Bayesian prior and posterior distributions are distributions of beliefs about parameters. They are not distributions that can happen. They happen only in the mind. Change your priors and you change your posterior. Even the Bayesian predictive distribution, which also intrinsically minimizes the K-L divergence between the prediction and nature, can never happen. It is just the weighted sum of the possible distributions that could happen over the posterior or prior.

The posterior is the Bayesian conclusion. Getting a point requires adding additional criteria that are then imposed on the posterior, prior, or predictive distributions.

There are many good reasons to use a Bayesian solution. In some cases, it is the only permissible solution. The very same thing can be said about non-Bayesian tools too.

If you look at the opportunity costs of your model, what school of estimation should you use?

Dave Harris
  • 6,957
  • 13
  • 21
5

Suppose your prior distribution is that a coin may be biased so it has success probability $\theta$ near $1/3.$ Specifically, you consider that $\theta\sim\mathsf{Beta}(2,4),$ so that $E(\theta)=1/3,$ $P(\theta < 1/2) = 0.8125,$ $P(0.0527 <\theta < 0.716) = 0.96,$ and $\theta$ has density function $f(\theta) = K\theta^{2-1}(1-\theta)^{4-1},$ where $K$ is the norming constant. [Computations in R.]

pbeta(.5, 2,4)
[1] 0.8125
qbeta(c(.025,.975),2,4)
[1] 0.05274495 0.71641794

Then you are allowed to toss the coin $n = 30$ times, obtaining $x = 9$ heads. Thus, your likelihood function is $g(x|\theta) \propto \theta^9(1-\theta)^{21},$ where the symbol $\propto$ indicates that the norming constant has been omitted.

Finally, by Bayes' Theorem, the posterior distribution is

$$g(\theta|x) \propto f(\theta)\times g(x|\theta)\\ \propto \theta^{2-1}(1-\theta)^{4-1} \times \theta^9(1-\theta)^{21}\\ \propto \theta^{11-1}(1-\theta)^{25-1}.$$

where we recognize the last line as the kernel (density without norming constant) of $\mathsf{Beta}(11,25).$

Thus the posterior mean of $\theta$ is $E(\theta|x)=11/36= 0.3056,$ slightly smaller than the prior $E(\theta) = 0.333$ because you have information from your 30 tosses.

Also, a 95% posterior credible interval for $\theta$ is $(0.169, 0.463).$

qbeta(c(.025,.975),11,25)
[1] 0.1685172 0.4630446

The posterior distribution has information from your initial hunch that the coin may be biased towards Tails and also the results of your thirty-toss experiment with the coin.

The interpretation of this Bayesian interval estimate differs from the 'long run' interpretation of a frequentist confidence interval. Because the prior and likelihood both describe the particular coin at hand, we can say that the interval $(0.169, 0.463)$ also applies to the coin at hand. In particular we are pretty sure the coin is not fair.

BruceET
  • 47,896
  • 2
  • 28
  • 76