History of uninformative prior theory

Question

I am writing a short theoretical essay for a Bayesian Statistics course (in an Economics M.Sc.) on uninformative priors and I am trying to understand which are the steps in the development of this theory.

By now, my timeline is made three main steps: Laplace's indifference principle (1812), Non-Invariant priors (Jeffreys (1946)), Bernardo reference prior (1979).

From my literature review, I've understood that indifference principle (Laplace) was the first tool used to represent lack of prior information but the missing requirement of invariance has led to its abandonment until the 40s, when Jeffreys introduced his method, which has the desired property of invariance. The arise of paradoxes of marginalization due to the careless use of improper prior in the 70s pushed Bernardo to elaborate his reference prior theory to deal with this issue.

Reading the literature, every author cites different contributes: Jaynes' maximum entropy, Box and Tiao's data-translated likelihood, Zellner, ...

In your opinion, what are the crucial steps I am missing?

EDIT: I add my (main) references, if someone needs:

1) The selection of prior by formal rules, Kass, Wasserman

2) A catalogue of non informative priors, Yang, Berger

3) Noninformative Bayesians Priors Interpretation and Problems with Construction and Applications

EDIT 2: Sorry for the 2 year delay but here you can find my essay here

once you have finished that theoretical essay, would you be some kind as to link it here? — Nikolas Rieble, Nov 18 '16 at 16:07
Sure, my deadline is in December, so you've to wait until there ;) — PhDing, Nov 18 '16 at 16:14
It would be great if you could provide answer for your own question summarizing your thesis. — Tim, Nov 18 '16 at 18:47
@Tim I've tried to sum up a bit what I've understood from my literature review — PhDing, Nov 18 '16 at 19:53
I've linked this article before, but [the epic history of maximum likelihood](https://arxiv.org/pdf/0804.2996.pdf), covers the historical "gap" between Laplace and Jeffrey's: where the work of Gauss, Hotelling, Fisher, Bernoulli, and others pointed estimation toward maximum likelihood during that time. — AdamO, Nov 18 '16 at 20:12
If I understand the premise correctly, Bernoulli worked under the theoretical framework of priors, and I think the early attempts at uncovering "uninformative priors" made everybody speculate why priors were even necessary, and much work occurred in the area of maximum likelihood. It's interesting that early inference was markedly more Bayesian (albeit suffering some rigor), and the advent of ML was later and more theoretically cogent. — AdamO, Nov 18 '16 at 20:17
@AdamO Thanks for the suggestion. I'll ask if it is on topic because the essay is specifically on uninformative prior in Bayesian statistics. Are you able to suggest me in this terms? — PhDing, Nov 19 '16 at 14:48
@alessandro you should definitely read the paper as it touches on these points specifically. — AdamO, Nov 19 '16 at 20:43
@alessandro it describes how the Laplacian approach was maintained for basically a century after Gauss developed and used uniform prints (conceiving them as noninformative). Pearson and Kristine Smith disavowed ML because the resulting inference did not deal with probabilities as a Bayesian would desire. — AdamO, Nov 19 '16 at 20:57
Minute (pedantic, if you like) but possibly useful point: Jeffreys = (Professor Sir) Harold Jeffreys, British applied mathematician, geophysicist and much else; he explained to me in a letter 40 years ago that he preferred the possessive Jeffreys's because Jeffreys' was liable to mutation to the quite incorrect Jeffrey's. Above we have an example! (It doesn't help that Richard C. Jeffrey, American philosopher, an entirely different person, also wrote on probability.) — Nick Cox, Nov 29 '16 at 19:11
@NickCox: It's alarming to hear of the near namesake. For some time I conflated two Bergers & I hope I haven't been doing the same with Jeffreys & Jeffrey. — Scortchi - Reinstate Monica, Dec 02 '16 at 10:54
@Scortchi From my collection: The following statisticians' surnames end with "s". Thus whatever is attributed to them is tagged with (e.g.) Jeffreys or Jeffreys', but not Jeffrey's. Harold Jeffreys 1891-1989 Colin L. Mallows 1930- John P. Mills fl.1926 (Mills ratio) Samuel Stanley Wilks 1906-1964 (but note Martin Bradbury Wilk 1922-2013) Frank Yates 1902-1994. — Nick Cox, Dec 02 '16 at 11:19
(ctd) Karl Pearson (1857-1936) was father of Egon S. Pearson (1896-1980). Gertrude M. Cox (1900-1978) was unrelated to David R. Cox (1924-). Maurice G. Kendall (1907-1983) was unrelated to David G. Kendall (1918-2007). — Nick Cox, Dec 02 '16 at 11:21
@NickCox: Oh! So (Maurice) Kendall's tau & Library, but (David) Kendall's notation for queues. — Scortchi - Reinstate Monica, Dec 02 '16 at 11:27
Quite so. Wilfrid S. Kendall (probably better known as probabilist than as statistician) is one of David's sons. See also https://en.wikipedia.org/wiki/Bridget_Kendall — Nick Cox, Dec 02 '16 at 11:35
I do wonder whether there is a place for mentioning conjugate priors (and/or the idea of using prior that correspond to a small amount of hypothetical prior data - e.g. unit information priors), which can coincide with flat priors, as well as Jeffreys' prior and - or is that more in the direction of computational methods. One might say that computational methods matter, because e.g. good MCMC methods took away certain limitations about which priors were easy to use in practice, or one could argue that this does not affect the basic principles. — Björn, Dec 05 '16 at 13:16

score 16 · Accepted Answer · edited Apr 13 '17 at 12:44

What you seem to be missing is the early history. You can check the paper by Fienberg (2006) When Did Bayesian Inference Become "Bayesian"?. First, he notices that Thomas Bayes was the first one who suggested using a uniform prior:

In current statistical language, Bayes' paper introduces a uniform prior distribution on the binomial parameter, $\theta$, reasoning by analogy with a "billiard table" and drawing on the form of the marginal distribution of the binomial random variable, and not on the principle of "insufficient reason," as many others have claimed.

Pierre Simon Laplace was the next person to discuss it:

Laplace also articulated, more clearly than Bayes, his argument for the choice of a uniform prior distribution, arguing that the posterior distribution of the parameter $\theta$ should be proportional to what we now call the likelihood of the data, i.e.,

$$ f(\theta\mid x_1,x_2,\dots,x_n) \propto f(x_1,x_2,\dots,x_n\mid\theta) $$

We now understand that this implies that the prior distribution for $\theta$ is uniform, although in general, of course, the prior may not exist.

Moreover Carl Friedrich Gauss also referred to using an uninformative prior, as noted by David and Edwards (2001) in their book Annotated Readings in the History of Statistics:

Gauss uses an ad hoc Bayesian-type argument to show that the posterior density of $h$ is proportional to the likelihood (in modern terminology):

$$ f(h|x) \propto f(x|h) $$

where he has assumed $h$ to be uniformly distributed over $[0, \infty)$. Gauss mentions neither Bayes nor Laplace, although the latter had popularized this approach since Laplace (1774).

and as Fienberg (2006) notices, "inverse probability" (and what follows, using uniform priors) was popular at the turn of the 19th century

[...] Thus, in retrospect, it shouldn't be surprising to see inverse probability as the method of choice of the great English statisticians of the turn of the century, such as Edgeworth and Pearson. For example, Edgeworth (49) gave one of the earliest derivations of what we now know as Student's $t$-distribution, the posterior distribution of the mean $\mu$ of a normal distribution given uniform prior distributions on $\mu$ and $h =\sigma^{-1}$ [...]

The early history of the Bayesian approach is also reviewed by Stigler (1986) in his book The history of statistics: The measurement of uncertainty before 1900.

In your short review you also do not seem to mention Ronald Aylmer Fisher (again quoted after Fienberg, 2006):

Fisher moved away from the inverse methods and towards his own approach to inference he called the "likelihood," a concept he claimed was distinct from probability. But Fisher's progression in this regard was slow. Stigler (164) has pointed out that, in an unpublished manuscript dating from 1916, Fisher didn't distinguish between likelihood and inverse probability with a flat prior, even though when he later made the distinction he claimed to have understood it at this time.

Jaynes (1986) provided his own short review paper Bayesian Methods: General Background. An Introductory Tutorial that you could check, but it does not focus on uninformative priors. Moreover, as noted by AdamO, you should definitely read The Epic Story of Maximum Likelihood by Stigler (2007).

It is also worth mentioning that there is no such thing as an "uninformative prior", so many authors prefer talking about "vague priors", or "weekly informative priors".

A theoretical review is provided by Kass and Wasserman (1996) in The selection of prior distributions by formal rules, who go into greater detail about choosing priors, with extended discussion of usage of uninformative priors.

I think Fienberg extended the proud of Bayesians too far. I personally strongly dislike using "inverse probability" to define anything because it does not seem to be consistent with the integral geometry picture proposed by Adler and Taylor. Any good statistical procedure should have its mathematical correspondence, inverse probability is so twisted that you can hardly analyze it when the problem is slightly more sensitive by my experience. — Henry.L, Dec 05 '16 at 23:26
@Henry.L ...nevertheless, it is a part of history of statistical thought :) Notice also that it's not only Fienberg who provides such examples. The whole anti-inverse-probability and anti-Bayesian rebel started *because* it became quite popular. — Tim, Dec 06 '16 at 13:53
@Tim Yeah, I guess that is what Thomas Kuhn called "shifting of scheme" and also known as "...opponents eventually die, and a new generation grows up" :)). — Henry.L, Dec 07 '16 at 02:11

score 6 · Answer 2 · edited Apr 13 '17 at 12:58

A few comments about flaws of noninformative priors (uninformative priors) are probably a good idea since the investigation of such flaws helped development of the concept of noninformative prior in history.

You may want to add some comments about the drawbacks/flaws of adopting noninformative priors. Among many criticisms I point out two.

(1) Generally the adoption of noninformative priors has consistency problems especially when the model distribution has multi-modal behavior.

This problem is not unique to noninformative priors but is shared by many other Bayesian procedures as pointed out in the following paper along with its discussions.

Diaconis, Persi, and David Freedman. "On the consistency of Bayes estimates." The Annals of Statistics (1986): 1-26.

Nowadays the noninformative prior is no longer a research focus. It seems that there is more interest in more flexible choices of prior in nonparametric settings. Examples are the Gaussian process prior in nonparametric Bayes procedure or a flexible model like a mixture of Dirichlet priors, as in

Antoniak, Charles E. "Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems." The annals of statistics (1974): 1152-1174.

But again such a prior has its own consistency problems.

(2) Most so-called "noninformative priors" are not well-defined.

This is probably the most evident problem associated with noninformative priors during their development.

One example is that the limit definition of noninformative prior as a limit of a sequence of proper priors will lead to a marginalization paradox. As you mentioned, Bernardo's reference prior also has the problem that Berger never proved that its formal definition is independent from its construction/partition. See the discussion in

Berger, James O., José M. Bernardo, and Dongchu Sun. "The formal definition of reference priors." The Annals of Statistics (2009): 905-938.

One best definition about Jeffreys' prior that is well-defined is that it is chosen to be a prior such that it is invariant under certain parallel translation over the Riemannian manifold equipped with Fisher information metric, but even that does not solve the first problem.

Also you may want to read my explanation about marginalization paradox.

This is an excellent post and none of us thought about it. Great job. — Dave Harris, Dec 05 '16 at 14:36
I've made several small edits to expression without trying to change any meaning or implication. Please check that your meaning is invariant under editing. — Nick Cox, Dec 06 '16 at 18:45

score 4 · Answer 3 · answered Nov 30 '16 at 00:11

I would have posted in the comments, but I guess I do not have the reputation yet. The only missing thing, not in the comments already marked, is a special case of noninformative priors whose origins that I have tried to hunt down and have not found. It may precede Jeffreys paper.

For the normal distribution, I have seen the Cauchy distribution used as a noninformative prior for data with a normal likelihood. The reason is that the precision of the Cauchy distribution is zero, where precision is one divided by the variance. It creates a rather peculiar set of contradictory concepts.

The formula for the Cauchy is $$\frac{1}{\pi}\frac{\Gamma}{\Gamma^2+(x-\mu)^2}.$$

Depending on how you define the integral there is either no defined variance or it goes to infinity about the median, which implies the precision goes to zero. In conjugate updating, which wouldn't apply here, you add the weighted precisions. I think this is why this idea of a proper prior with a perfectly imprecise density formed. It is also equivalent to Student's t with one degree of freedom, which could also be the source.

This is a strange idea in the sense that the Cauchy distribution has a well defined center of location and inter-quartile range, which is $2\Gamma$.

The two earliest references to the Cauchy distribution are as likelihood functions. The first in a letter from Poisson to Laplace as an exception to the Central Limit Theorem. The second was in 1851 journal articles in a battle between Bienayme' and Cauchy over the validity of ordinary least squares.

I have found references to its use as a noninformative prior back into the 1980's but I cannot find a first article or book. I also have not found a proof that it is noninformative. I did find a citation to Jeffreys' 1961 book on probability theory, but I have never requested the book via interlibrary loan.

It may be simply weakly informative. The 99.99% highest density region is 1272 semi-interquartile ranges wide.

I hope it helps. It is a weird special case, but you see it come up in a number of regression papers. It satisfies the requirements for a Bayes action by being a proper prior, while minimally influencing location and scale.

History of uninformative prior theory

3 Answers3

Linked

Related