Bayesian and frequentist optimization and intervals

Question

I realize the methodology pursued by the Frequentist and Bayesian camps generally differ. However, one method of estimation that they do share is optimization of a certain function:

Frequentists maximize the likelihood function, giving the Max. Likelihood (ML) estimator.
Bayesians maximize the posterior function, giving the Max A-Posteriori (MAP) estimator.

Both functions will typically have been constructed using Baye's rule/theorem, which is universally agreed upon, and which might have been applied once (in "batch mode") or multiple times iteratively.

Similarly, both Frequentists and Bayesians will deduce their interval (confidence/credibility) from this function.

So if the prior is uninformative (assuming we can formulate such a prior), there should be no distinction between the "results" obtained by Bayesians and Frequentists, even though the interpretation of said results will be different.

If this is right, then the only practical difference between Bayesians and Frequentists is the prior. Is this true?

Edit:

Actually, the optimization bit of my question is a bit misleading, as it is only a specific example of differences between Bayesian and Frequentist thinking. My question could be posed simply as the difference simply between the likelihood function and the posterior. For example, would frequentists ever use MCMC to calculate the likelihood function?

Edit, 10 years on: This was a confused question from a confused student. To be fair, the topic tends to confuse the uninitiated. Thanks to everyone who participated in the discussions.

It depends on the definition of *noninformative prior* you use. If you use a flat prior on a domain that contains the MLE, then this coincides with the MAP but if you use a different sort of noninformative prior, then they may differ. — , Jun 01 '12 at 17:16
Regarding interval estimation, Bayesian and frequentist intervals have similar properties if you use [matching priors](http://www.ucl.ac.uk/statistics/research/pdfs/rr252.pdf), but not necessarily under the use of other sorts of noninformative priors. For more information about the differences between these two approaches, take a look at this questions [1](http://stats.stackexchange.com/questions/22/bayesian-and-frequentist-reasoning-in-plain-english), [2](http://stats.stackexchange.com/questions/27589/why-would-someone-use-a-bayesian-approach-with-a-noninformative-improper-prior) — , Jun 01 '12 at 17:30
Indeed, [Data cloning](http://onlinelibrary.wiley.com/doi/10.1111/j.1461-0248.2007.01047.x/abstract) is an example of an MCMC method used for maximising the likelihood function. Also, both approaches get benefited by some nonparametric methods, see [1](http://www.stat.lsa.umich.edu/~ionides/pubs/msle.pdf), [2](http://nature.berkeley.edu/~pdevalpine/papers/StateSpace/deValpine_JASA_04.pdf). — , Jun 01 '12 at 18:11
ML is not the only frequentist estimation procedure. Many estimation procedures can be justified because they are Bayes according to *some* (perhaps arbitrary) prior; a theorem says that (under certain regularity assumptions) such procedures are admissible. (One might say that every frequentist is a Bayesian when a defensible prior can be found, but also a frequentist would hesitate--on the same grounds--to use an "uninformative" prior.) But other frequentist procedures--notably certain minimax estimators--might not have any Bayes counterparts at all. — whuber, Jun 01 '12 at 19:04
@Procrastinator About MCMC methods for likelihood functions: Right, so they do pretty much use the same techniques as in Bayesian statistics. Still it seems state-space estimation is most commonly approached from the Bayesian perspective (even though usually the prior is "swamped out" by the data in most cases). Why is that? — Patrick, Jun 01 '12 at 20:42
@Patrick I am not sure if I can give you *the* reason for that but one of the reasons could be that in state-space models you are usually interested on prediction which can be conducted using the predictive distribution in the Bayesian setting while in the frequentist/classical/Fisherian approach, to the best of my knowledge, it is not clear how to do this in general in a way that considers the variability of the parameters. In addition, the coverage of profile likelihood intervals is not very good in high dimensions and small-medium samples. Someone may correct me if I am mistaken — , Jun 01 '12 at 22:29
This point from the OP is not true: "both Frequentists and Bayesians will deduce their interval (confidence/credibility) from this function". The frequentist interval is not deduced from the likelihood (I mean the "observed" likelihood"). It violates the likelihood principle (this is also the case for Bernardo's objective Bayesian theory) — Stéphane Laurent, Jun 02 '12 at 09:14
@StéphaneLaurent, ok so where exactly do they deduce it from? — Patrick, Jun 02 '12 at 14:31
@Patrick I don't understand who and what are you talking about ? — Stéphane Laurent, Jun 02 '12 at 16:14
@StéphaneLaurent, where exactly do you deduce the confidence interval from, if not from the likelihood? — Patrick, Jun 03 '12 at 13:54
There's no general way to derive a confidence interval in frequentist statistics. But whatever the way you use, the confidence interval obviously depends on the sampling distribution $p(\cdot \mid\theta)$, and not only on the likelihood $L(\theta \mid x^{\text{obs}})=p(x^{\text{obs}}\mid \theta)$. — Stéphane Laurent, Jun 03 '12 at 15:05
@StéphaneLaurent, Thanks for your responses. Please bear with me. Isn't the likelihood (as you point out in your equality) just the sampling distribution regarded as a function of $\theta$ ? — Patrick, Jun 05 '12 at 09:54
Not really. See this discussion http://stats.stackexchange.com/questions/29682/how-to-rigorously-define-the-likelihood The likelihood is the function $\theta \mapsto p(x^{\text{obs}}\mid \theta)$ for the observed issue of the experiment $x^{\text{obs}}$, whereas the sampling distribution is the probability measure $p(\cdot \mid\theta)$. — Stéphane Laurent, Jun 05 '12 at 11:05

Dikran Marsupial · Accepted Answer · 2012-06-01T18:17:02.720

The maximum aposteriori (MAP) approach isn't really fully Bayesian, ideally inference should involve marginalising over the whole posterior. Optimisation is the root of all evil in statistics; it is difficult to over-fit if you don't optimise! ;o) So the practical difference between Bayesian and frequentist runs rather more deeply if you opt for a fully Bayesian solution, although there will often be a prior for which the result is numerically the same as the frequentist approach.

However, the credible interval and confidence interval are answers to different questions, and shouldn't be considered interchangable, even if they happen to be numerically the same. Treating a frequentist confidence interval as if it was a Bayesian credible interval can lead to problems of interpretation.

UPDATE "My question could be posed simply as the difference simply between the likelihood function and the posterior."

No, the definitions of probability are different, this means that even if the solutions are numerically identical, this doesn't imply they mean the same thing. This is a practical issue as well as a conceptual one, as the correct intepretation of the result depends on the definition of probability.

Tongue in cheek, if the answer to the question was "yes", it would imply that frequentists were merely Bayesians that always used flat (often improper) priors for everything. I doubt many frequentists would agree with that! ;o)

Yes, if you've done all the hard work involved in calculating the posterior, you don't want to thrown away information by reducing it to the MAP. I guess my question can be reformulated simply in terms of posterior func. vs likelihood func., instead of MAP vs ML. I am aware that the credibility and confidence intervals are answers to different questions. However, they can both be used as interval estimates for the parameter, which makes them comparable. — Patrick, Jun 01 '12 at 17:47
However different things are meant by "interval estimate" in each framework. All too often a frequentist interval is interpreted as being an interval that contains the true value with high probability, which is not supported by the frequentist analysis, and as Jaynes demonstrated isn't necessarily even true. Rather than adopting one framework or the other, the best thing to do is to be comfortable with both, and use the framework that most directly answers the question you really want answered. — Dikran Marsupial, Jun 01 '12 at 18:02
@DikranMarsupial +1 I really like your answer and also your followup comment. I think the war between the Bayesian's and frequentists is over and nobody won. A lot of statisticians including Brad Efron are adopting the position you describe in last sentence of the statement above. — Michael R. Chernick, Jun 01 '12 at 19:18

score 5 · Answer 2 · answered Jun 01 '12 at 17:25

5

I would agree that roughly speaking you are right. Priors noninformative or not will lead to different solutions. The solutions will converge when the data dominates the prior. Also Jeffreys needed improper priors in some cases to match Bayesian results with frequentist results. The real difference and the controversy is philosophical. Frequentists want objectivity. The prior brings in subjective opinion. Bayesians following the teachings of Di Finetti believe that probability is subjective. For a true Bayesian priors should be informative. The other point also related to the differing concepts of probability is that probability can be assigned to an unknown parameter according to Bayesians while frequentist think strictly in term of probability spaces as given in the theory developed by Kolmogorov and von Mises. For frequentists it is only the random avriables that you can define on the probaility space that have probabilities associated with their outcomes. So the probability of getting a head on a coin toss is 1/2 because repeated flipping leads to a relative frequence of heads that converges to 1/2 as the sample size approaches infinity.

For frequentists Bayes' theorem applies to events which are measureable sets in a probabilty space. The Bayesians apply it to parameters as if the parameter was a random variable. That is the frequentists objection to Bayesian methods. The Bayesians object to the frequenttist approach because it lacks a property called coherence. I will not go into that here but you can look up the definiton on the internet or read Dennis Lindley's books.

answered Jun 01 '12 at 17:25

Michael R. Chernick

39,640
28
74
143

3

+1, however I would disagree that Bayesianism is necessarily subjective (I agree with Jayenes on that one) or that frequentism is completely objective (there are often assumptions that are equivalent to having a prior belief about the model, but they are not a formal part of the framework, but exist nevetheless). Also I don't see why noninformative priors are non-Bayesian, they encode the prior knowledge that you know you don't know something. As usual, this doesn't imply that Bayesianism is better than frequentism or vice versa, just as a flat screwdriver is not better than a pozidrive one. – Dikran Marsupial Jun 01 '12 at 17:31
@DikranMarsupial I think there are various camps in both the Bayesian and frequentists schools. But I think what I would call the pure Bayesians are those that follow Di Finetti's subjective view of probability. The purpose of the prior is to include subjective information prior to collecting data. Pure frequentists follow von Mises approach to probability. Where should we put the empirical Bayesians? Do the belong to the Bayesian school or in their attempt to inform the prior with the data are the more in the frequentists school. – Michael R. Chernick Jun 01 '12 at 17:46
1

Fisher was the first "fiducialist." A school whose following is almost nonexistent now. But his school wanted to do inference based on a form of inverse probability that involved the likelihood function but not the prior yet differed from the approach to inference due to Neyman. My point was not really to strictly define the Bayesian and frequentist camps but rather to show the OP (Patrick) that it isn't quite as simple as he makes it out to be. – Michael R. Chernick Jun 01 '12 at 17:51
1

I agree about the various camps, if you feel that only followers of di Finetti are true Bayesians then perhaps it would be better state that it is your opinion, rather than state it as a fact, as followers of Jayenes for instance would not be in agreement, and are no less pure Bayesians than those of di Finetti, neither have a claim to be the one true school, nor to be the final word. In my experience uninformative priors are extremely useful as a guard against jumping to conclusions by ingoring your ignorance of some aspect of the system. – Dikran Marsupial Jun 01 '12 at 17:54
I would include anyone who viewed probability as being a measure of the state of knowledge as being a Bayesian, and those who strictly limit it to long run frequencies as being frequentists. That seems the most crucial distinction between the two approaches. I agree with most of what you wrote (hence the +1) I was just adding some caveats where it was over-specific. I had an interesting discussion about fiducialism with a frequentist colleague a while back, it seems to me that it was a tacit admission that the quest for objectivity placed uncomfortable limits on what could be addressed. – Dikran Marsupial Jun 01 '12 at 17:58
I pretty much understand that frequentists don't view the parameter as a random variable, and hence cannot view the likelihood function, seen as a function in the parameter, as a probability (hence its name). Still, it would seem that doesn't really impact the information content of the likelihood function compared to the posterior. – Patrick Jun 01 '12 at 18:13
I didn't mean to get you so bent out of shape about this. Be clear that I am not defining frequentist or Bayesian camps. I am sorry that my language may have been a little loose. As I said I was just trying to show the OP that it isn't very simple and the best way to show the differences is to look at the extremes of the camps. When I used the term pure Bayesian or pure Frequentist I did not mean to put any kind of connotation on it in terms of good or bad etc. – Michael R. Chernick Jun 01 '12 at 18:15
I was just trying to distinguish between the subjectivist that follow Di Finetti (like Lindley that I classified as pure but will accept whatever label you think is not offensive) from others like Jaynes and Jeffreys. On the other hand I wanted to distinguish those that follow von Mises (which I called pure but again will accept any nonoffensive characterization) from others in the frequentist camp as well as the empirical Bayesians and the fiducialist. – Michael R. Chernick Jun 01 '12 at 18:17
Michael, don't worry, I'm not bent out of shape, just adding caveats for those who might not be aware of the range of views. Electronic means of communication are not very good at conveying the tone reliably. I'm an engineer by training, so I like any tool that has its use and while I like the objectivist Bayesian framework, I am not a zealot about it! ;o) F*#@y logic, that is another matter entirely! – Dikran Marsupial Jun 01 '12 at 18:22
@Patrick I think the frequentist can look at the likelihood as a probability. It is the probability of obtaining the observed sample data given the fixed value of the parameter. As a function of the parameter theta it shows how that probability changes with different fixed values of theta. The likelihood principle says that the estimate to pcik for theta is the one that maximizes this probability. – Michael R. Chernick Jun 01 '12 at 18:22
@DikranMarsupial My previous comments came before I read your last one. I think we are pretty much on the same page and I am sorry for the poor semantics. – Michael R. Chernick Jun 01 '12 at 18:24
No problem, I have found your posts very informative, keep up the good work! – Dikran Marsupial Jun 01 '12 at 18:38
I disagree with this point in the comment by Dikran: " I don't see why noninformative priors are non-Bayesian, they encode the prior knowledge". Any prior distribution is informative. In fact, a noninformative prior is not a distribution. This is just a function yielding a "noninformative posterior" via a formal application of Bayes formula. A "noninformative posterior" is a distribution which aims to reflect what information about the parameters is brought by the data and only the data. My school is Bernardo's theory of reference priors. – Stéphane Laurent Jun 02 '12 at 09:10
@StéphaneLaurent I agree with your point with one minor correction. Some priors that are called noninformative that are improper and therefore are not probability distributions. But this depends of the parameter and its range of possible values. A noninformative prior that is a distribution is U[0,1] for a proportion p. – Michael R. Chernick Jun 02 '12 at 12:54
I also consider that the Jeffreys-Bernardo prior Beta(1/2,1/2) for a binomial proportion is not a distribution whereas it is integrable. As any other distribution, Beta(1/2,1/2) is informative. Concerning the uniform distribution on (0,1), it is not noninformative in the sense of some theoretical definition. – Stéphane Laurent Jun 02 '12 at 13:23
(there were no characters left) People say this prior is noninformative because it assigns the same probability to each possible value of the parameter, and I think this a mistake to consider that uniformity means noninformativeness - in passing this would imply that the prior on the odd parameter p/(1-p) is informative since it is not uniform ? this is not coherent; every distribution is uniform up to a transformation. – Stéphane Laurent Jun 02 '12 at 13:24
I understand the Bayesian argument that you have given and that is why I said I agree. But my only point of contention is that if f(x) is nonnegative and integrates to 1 it is a probability density and hence describes a probability distribution. So I don't understand why you say that proper priors such as Beta(1/2, 1/2) are not distributions? I agree that the priors that are called non-informative are informative. I just used the term because that is how they are commonly called. – Michael R. Chernick Jun 02 '12 at 13:39
This is a matter of definition: the only distribution which is defined in Bernardo's theory is the reference posterior distribution. The function $\pi(\theta)=\theta^{-\frac12}(1-\theta)^{-\frac12}$ is such that you get the reference posterior distribution when applying Bayes' formula with $\pi(\theta)$ as if it was a prior distribution. This is the definition of a reference prior function, and there's no definition of a reference prior distribution. The reason is that the reference prior function does not describe prior knowledge, even when it is an integrable function. – Stéphane Laurent Jun 02 '12 at 14:16

Bayesian and frequentist optimization and intervals

2 Answers2