How does Bayesian analysis make accurate predictions using subjectively chosen probabilities?

Question

Since Kahneman and Tversky found that humans do not accurately assume probabilities, how can Bayes theorem use subjectively chosen probabilities to accurately predict things (like insurance policies), when given extra data?

In other words, humans are often wrong about probabilities, but Bayes theorem still works when using our (prior) estimates. Have I misunderstood things?

To clarify the insurance reference, before big datasets were available, American insurers relied on a Bayesian method to calculate insurance premiums, starting with priors that were, for all intents, a best guess. The policies were still fair and profitable.

(Edit: I haven't accepted an answer for this question, because there are several answers below which together answer the question.)

I added the Frequentist prediction to round it out. – Dave Harris Nov 14 '17 at 05:35 — Dave Harris, Nov 14 '17 at 05:35

Dave Harris · Answer 1 · 2017-11-14T05:30:46.950

You are talking about Bayesian analysis, not Bayes theorem, but we know what you mean.

Let me hit you with an idea that is even more strange than the one you are thinking about. As long as you use your real prior density in constructing your model, then all Bayesian statistics are admissible, where admissibility is defined as the least risky way to make an estimate. This means that even in the K-T example, you will get an admissible statistic. This does exclude the case of degenerate priors.

K-T do not directly discuss the formation of priors, but the idea you are trying to get across is the idea of predictive accuracy with flawed prior distributions.

Now let's look at Bayesian predictions under two different knowledge sets.

For purposes of exposition, it is to be assumed that the American Congress, exercising its well-noted wisdom, and due to heavy lobbying by the Society of American Magicians, has decided to produce magical quarters. They authorize the production of fair, double-sided and biased coins. The double-headed and double-tailed coins are easy to evaluate, but the two-sided coins are not without flipping them.

A decision is made to flip a coin eight times. From those flips, a gamble will be made on how many heads will appear in the next eight flips. Coins either have a 2/3rds bias for heads, a 2/3rds bias for tails, or it is a perfectly fair coin. The coin that will be tossed is randomly selected from a large urn containing a representative sample of coins from the U.S. Mint.

There are two gamblers. One has no prior knowledge, but the other phoned the US Mint to determine the distribution of the coins that are produced. The first gambler gives 1/3rd probabilities for each case, but the knowledgeable gambler sets a fifty percent probability on a fair coin, and even chances for either two from the remaining probability.

The referee tosses the coin, and six heads are shown. This is not equal to any possible parameter. The maximum likelihood estimator is .75 as is the minimum variance unbiased estimator. Although this is not a possible solution, it does not violate theory.

Now both Bayesian gamblers need to make predictions. For the ignorant gambler, the mass function for the next eight gambles is: $$\Pr(k=K)=\begin{pmatrix}8\\ k\end{pmatrix}\left[.0427{\frac{1}{3}}^k{\frac{2}{3}}^{8-k}+.2737{\frac{1}{2}}^8+.6838{\frac{2}{3}}^k{\frac{1}{3}}^{8-k}\right].$$ For the knowledgeable gambler, the mass function for the next eight gambles is: $$\Pr(k=K)=\begin{pmatrix}8\\ k\end{pmatrix}\left[.0335{\frac{1}{3}}^k{\frac{2}{3}}^{8-k}+.4298{\frac{1}{2}}^8+.5367{\frac{2}{3}}^k{\frac{1}{3}}^{8-k}\right].$$

Even in this trivial case, the predictions do not match, yet both are admissible? Why?

Let's think about the two actors. They both have included all the information they have. There is nothing else. Further, although the knowledgeable actor does know the national distribution, they do not know the distribution to their local bank. It could be that they are all biased toward tails. Still, they both impounded all the information that they believe to be true.

Now let us again imagine that this game is played one more time. The two gamblers happen to be sitting side-by-side, and the ignorant gambler gets to see the odds of the knowledgeable gambler, and vice-versa. The ignorant gambler can recover the knowledgeable gambler's prior information at no cost by inverting their probabilities. Now both can use the extra knowledge.

The referee tosses four heads and four tails. This knowledge is combined to create a new prediction that is now joint among the gamblers. Its image is in the chart below.

A gambler who had only seen four heads and four tails and had not seen the prior tosses may have yet a third prediction. Interestingly, for Frequentist purposes, you cannot carry information over to a second sample, so the prediction is independent of prior knowledge. This is bad. What if it has been eight heads instead, or eight tails. The maximum likelihood estimator and minimum variance unbiased estimator would be for a double-headed or double tailed coin with no variance either.

For this second round prediction, no admissible Frequentist estimator exists. In the presence of prior knowledge, Frequentist statistics cease being admissible. Now an intelligent statistician would just combine the samples, but that does violate the rules unless you are doing a meta-analysis.

Your meta-analysis solution will still be problematic, though. A Frequentist prediction could be constructed from the intervals and the errors, but it would still be centered on 10/16ths, which is not a possible solution. Although it is "unbiased" it is also impossible. Using the errors would improve the case, but this still is not equal to the Bayesian method.

Furthermore, this is not limited to this contrived problem. Imagine a case where the data is approximately normal, but without support for the negative real numbers. I have seen plenty of time series analysis with coefficients that are impossible. They are valid minimum variance unbiased estimators, but they are also impossible solutions as they are excluded by theory and rationality. A Bayesian estimator would have put zero mass on the disallowed region, but the Frequentist cannot.

You are correct in understanding that Bayesian predictions should be biased, and in fact, all estimators made with proper priors are guaranteed to be biased. Further, the bias will be different. Yet there is no less risky solution and when they exist, only equally risky solutions when using Frequentist methods.

The Frequentist predictions do not depend upon the true value of $p$, which is also true for the Bayesian, but does depend upon the count of observed outcomes. If the Frequentist case is included, the the prediction becomes the following graph.

Because it cannot correct for the fact that some choices cannot happen, nor can it account for prior knowledge, the Frequentist prediction is actually more extreme because it averages over an infinite number of repetitions which have yet to happen. The predictive distribution turns out to be the hypergeometric distribution for the binomial.

Bias is the guaranteed price that you must pay for the generally increased Bayesian accuracy. You lose the guarantee against false positives. You lose unbiasedness. You gain valid gambling odds, which non-Bayesian methods cannot produce.

"for Frequentist purposes, you cannot carry information over to a second sample" ... you can't? If a frequentist tosses a coin 10 times today and 30 times tomorrow, can a frequentist not simply say "I have two sets of observations on this coin, let me estimate P(head) from all the available data". Indeed, for a more complicated model (perhaps one with variation in parameters from one set of observations to the next in some fashion), one can often still estimate parameters — Glen_b, Nov 14 '17 at 07:37
Thanks Dave! So, essentially, Baysean approaches work because they can account for data that a frequentist approach can't. Part of the answer, then, is that humans can do that easily, so even with a degenerate prior, they may be capturing data that wouldn't otherwise be included. — Grubbmeister, Nov 14 '17 at 07:40
@Grubbmeister, no, except in the case of a degenerate prior. If one believed X with 100 percent certainty, then it would be impossible to learn that Y is true. — Dave Harris, Nov 14 '17 at 13:14
I think some additional care in characterizing things and a somewhat tighter exposition would help this post. To mention just a couple of brief things: (a) the concept of *admissibility* is less defined as "the least risky way to make an estimate" and more one of "a not obviously inferior estimate", (b) there are plenty of places, even in financial economics, where "putting zero mass on a disallowed region" has gotten people in a serious mess, e.g., *very* popular stochastic interest rate models where the rate cannot go negative, pricing models where prices don't decline, etc., ... — cardinal, Nov 14 '17 at 17:24
... and (c) we must also consider the flip side of part of the argument here where the priors can end up strongly (and at times, possibly inadvertently) coloring our conclusions. I remember a passionate discussion in AmStat News from a few years ago where courtroom evidence was based on a Bayesian analysis with a particular prior. Turns out that prior implicitly excluded the possibility of rejecting a certain hypothesis of jurisprudential interest _regardless of what the observed data could have been_. — cardinal, Nov 14 '17 at 17:31

score 5 · Answer 2 · answered Nov 14 '17 at 01:36

While a person might be wrong in a particular moment about the likelihood of a particular thing happening, the idea behind Bayes theorem (as applied to updating your understanding in the face of new information) is that the updated probability may not be entirely right, but that it will be more right than you were when you started.

I think of a sitution where I'm trying to estimate the number of sheep in a field - let's imagine that there are truly 100, but I'm going to estimate that there are no sheep at all (which is about as wrong as I can get). Then, I see a sheep, and update my estimate - now, I estimate that there's one sheep in the field! I'm still wrong, but I'm slightly less wrong than I was when I started. In this way, if you collect enough information, you can update your estimates to be closer to reality - and, indeed, by collecting enough data, you can get arbitrarily close to reality.

A really good description of this (albeit a pretty technical one) is in Savage's The Foundations of Statistics. It's a great read, and he develops a way of thinking about probability that makes a lot more sense from a Bayesian perspective.

Nice point and +1 for mentioning Savage! Welcome to the site! — Tim, Nov 14 '17 at 08:39

Benoit Sanchez · Answer 3 · 2017-11-14T17:27:43.317

When you consider specifying a prior vs using a raw frequentist method, sometimes the prediction will be much better even with an apparently wrong prior because you don't need the prior to be precise at all to improve things. A very rough prior helps to rule out what is unrealistic, and this helps even if specified very imprecisely. It's a probabilistic method of restricting the parameter to a subset of possible values. For the joke: in the case of estimating the number of sheep in a field, ruling out values greater than $10^{52}$ is a safe guess.

A counter-productive prior is of course theoretically possible: believing firmly something that is totally unreal. But it is most often the consequence of a wrong mathematical understanding in a difficult formalization. This is an example of such a mistake: http://www.nowozin.net/sebastian/blog/estimating-discrete-entropy-part-3.html

If the overall formalization is good, wrong numerical information has less consequences. I was convinced by one of the simplest Bayesian methods: $L^2$ regularization in linear regression.

The model is $Y=\beta X+\epsilon$. If you have small data compared to features dimension, the frequentist basic estimator (MLE) $\hat\beta$ will most be often be extremely over-fitted and yield very poor predictions because you allow it to consider every possible $\beta$.. It's not rare that it has higher error than a constant predictor like 0 (in a real situation).

Now some vague intuition, experience, rumour... tells you that actually $\beta$ is unlikely to have a big norm, that estimations of $\beta$ with a big norm is just an effect of over-fitting.

You think the real $\beta$ tends to be reasonably small. You say: my $\beta$ is around 0 with variance... hum... dunno...say 1. Formally, this is a Gaussian prior on $\beta$. 1 is the regularization constant.

But if you choose 2 instead of 1, you'll get roughly the same results. And if you choose 1.2, you can't even see the difference. (not giving a general fact here, just for that it's the kind of thing we often observe). Actually there is a very wide range of values that will yield much better result than the non regularized estimator, and the error curve tends to be pretty flat around the optimal choice.

I did a few simulations with wrong prior specification in this case: you may assume a very false prior, yet the result are still far better than without regularization. Because a flat prior is worse than the worst misspecification you can reasonably think of.

As an hyper-parameter, the regularization coefficient can be chosen loosely without much consequence on the prediction. It tends to be true in many machine learning situations: the more you go to hyper-paramter, hyper-hpyer-paramters... the least it becomes sensitive to wrong specification. (usually and if the methods are good)

score 2 · Answer 4 · answered Nov 14 '17 at 01:31

First, Bayes theorem doesn't make predictions. It's a mathematical law. But you have to get the probabilities right for it to work.

Second, you may be thinking of a Bayesian approach to data analysis. This does depend on priors but a) Sometimes (often) a uniform prior is chosen b) Other times, the prior is based on actual data.

Third, Kahnemann and Tversky really have nothing to do with this. They talk about how people reason with probability, even if the probabilities are given to them. For instance, a 10% risk of death is not viewed the same way as a 90% chance of survival. K & T do a lot of damage to the notion of a "rational human" but that's more about economics than statistics.

How does Bayesian analysis make accurate predictions using subjectively chosen probabilities?

4 Answers4