Properties of MaxEnt posterior distribution for a die with prescribed average

Question

Question: after throwing a die a large number of times and discovering that the average of the outcomes is $4$, what probability distribution one should assign to statements "the next roll will be $i$" for $i = 1, 2, \dots, 6$?

E.T. Jaynes in Chapter 9 of the book "Probability Theory: The Logic of Science" derives the following:

if we start with ignorance knowledge, $I_0$, meaning that all individual rolls are independent and are equally likely to occur, so $1/6$ for each outcome, then we just need to find $(p_1, p_2, \dots, p_6)$ that maximises $H(p_1, p_2, \dots, p_6) := \sum_{i=1}^6 -p_i \log p_i$ subject to $\sum_{i=1}^6 i p_i = 4$.

Utilising Lagrange multipliers one can easily derive Boltzmann posterior distribution for such a die. My posterior distribution, found numerically, is the following:

$$(p_1, p_2, p_3, p_4, p_5, p_6) \approx (0.10,0.12,0.15,0.17,0.21,0.25).$$

Moreover E.T. Jaynes advocates that such posterior distribution is the only answer consistent with prior knowledge $I_0$, the data $D=\{\text{the mean is $4$}\}$ and Cox's theorem. However, I have a few questions about such posterior:

1) Qualitative: does it really do what common sense dictates it should do? Why the posterior doesn't have more mass on $4$?

2) Why the mode of the posterior is $6$ rather than $4$? Under what loss function should I guess 4?

3) Why the MLE approach fails to give the mode of the posterior distribution, despite the following quote:

A maximum likelihood estimator coincides with the most probable Bayesian estimator given a uniform prior distribution on the parameters. (Wiki: MLE).

P.S. Haskell code used for finding the answer:

l = concat $ map (\a -> [(-a), a]) l'
    where
         l' = map (/ 1000) [1.. ]

findExp xs = sum $ zipWith (*) [1..] xs

entropy xs = sum $ map (\a -> - a * (log a)) xs

probs lam = map (/ sum probs') probs'
      where
           probs' = map (\a -> exp (-lam * a)) [1..6]

expectation  = 4.0

condition lam = abs (findExp (probs lam) - expectation) <= 0.05

main = print dist >> print (findExp dist) >> print (entropy dist)
      where
        lamda = head (dropWhile (not . condition) l)
        dist = probs lamda

The short answer is that there is no such thing at _the_ non-informative prior. See this [entry from Cross validated](http://stats.stackexchange.com/q/20520/7224). — Xi'an, Nov 01 '15 at 14:58
The reason why there's not more weight on 4 is that the cases that have more weight on 4 (but are consistent with the other aspects of the question) all have lower entropy. — Glen_b, Nov 03 '15 at 00:23

Tom Artiom Fiodorov · Accepted Answer · 2015-11-03T10:51:52.327

The MaxEnt algorithm heavily favors distributions as close to uniform as possible. Therefore, given the constraint that the average is $4$, it is more optimal to add more mass to $6$ rather than $4$ in order for the posterior to stay close to uniform.

Why such tendency? Perhaps it has something to do with the shortest message possible. It's simple to encode distributions close to the uniform. Why? Well, the uniform distribution is at the extreme end of this: instead of encoding $(\frac{1}{6}, \frac{1}{6}, \frac{1}{6}, \frac{1}{6}, \frac{1}{6}, \frac{1}{6})$ one gets away with encoding roughly $[6, \frac{1}{6}]$. I.e. $\frac{1}{6}$ $6$-times.

So what MaxEnt does is akin to Occam's razor/minimum message length and the like: through the space of all possible explanations (posteriors) it finds the simplest one which explains the data. And then it tells you that the simplest one is most likely: so if I were to encode each posterior as a binary string, the entropy of each string could itself be renormalised as some probability measure. The shortest string is then just the mode of such entropy-probability-measure on all strings.

My intuition called for $4$ to have more mass but perhaps it was just a fallacy of my own intuition: given the above reasoning I find it plausible that guessing $6$ is the right thing to do if I am rewarded only when I guess a toss correctly.

On the other hand, it seems that conditioning on the most probable hypothesis is short-sighted: it ignores all other gains to be made from other hypotheses which are not likely, but plausible. Therefore a very important question arises: when conditioning on the mode of the entropy-probability-measure is equivalent to taking the expectation with respect to entropy-probability-measure? I suspect that MaxEnt distribution is simply the expectation of this entropy-probability-measure, so always.

As for the loss functions which would force me to guess $4$, it is a well-known result that if my loss-function is a mean-squared error, I should guess the expectation of the posterior, which is, of course, $4$.

I am yet to connect this with MLE's... So to be continued/edited.

score 1 · Answer 2 · answered Nov 19 '15 at 11:28

1

A Bayesian inference of the posterior probabilities given prior uniform probabilities is closer to what you had in mind. Because, indeed, the likelihood for getting a 4 for a loaded dice that has a large probability of 4 is high.

The "problem" is that that inference results in a posterior non-uniform distribution even for a fair dice, with an average value of 3.5. That is for the very same reason: some loaded dices also result in an average value of 3.5. So even as $N \to \infty$ the posterior is different form a the uniform distribution that MaxEnt would give you.

Uffink explained that with much more detail, here: http://www.projects.science.uu.nl/igg/jos/mep2def/mep2def.pdf

answered Nov 19 '15 at 11:28

Ramon Crehuet

181
6

If mean is $6$ MaxEnt results in $(0, 0, 0, 0, 0, 1)$ *because* number of throws, $N$, was taken to be $\infty$. MaxEnt is the *limitting* distribution as the number of throws goes to $\infty$. – Tom Artiom Fiodorov Nov 20 '15 at 14:40
His table 1 departs from MaxEnt solution because MaxEnt assumes all outcomes (represented as infinite strings of $1 \dots 6$) are equally likely, *not* all frequences $(N_1, N_2, N_3, N_4, N_5, N_6)$. "Problem" is only a problem if one doesn't follow MaxEnt derivation, found in Chapter 9 of Jayne's book. – Tom Artiom Fiodorov Nov 20 '15 at 14:59
So in his equation $(40)$, *not all* $P(N_1, N_2, \dots, N_6)$ are equal. – Tom Artiom Fiodorov Nov 20 '15 at 15:12

Properties of MaxEnt posterior distribution for a die with prescribed average

2 Answers2