Why do we want a maximum entropy distribution, if it has the lowest information?

Question

It is said that the distribution with the largest entropy should be chosen as the least-informative default. That is, we should choose the distribution that maximizes entropy because it has the lowest information content, allowing us to be maximally surprised. Surprise, therefore, is synonymous with uncertainty.

Why do we want that though? Isn't the point of statistics is to estimate with minimal error or uncertainty? Don't we want to extract the most information we can from a dataset/random variable and its distribution?

Hi: It's the least informative beforehand in the sense of making the least assumptions about the values of the distribution and still being subject to some constraint. ( the second moment being equal to $\sigma^2$). It's explained pretty nicely in the link below. The idea is that, BEFORE-HAND, you don't want to make any assumptions about the information in your sample. The distribution with the greatest entropy gives you that characterization. https://en.wikipedia.org/wiki/Maximum_entropy_probability_distribution. — mlofton, Jul 29 '20 at 11:52
ok at what point does the max entropy become useful though in the sense of minimizing uncertainty in characterizing a dataset? — develarist, Jul 29 '20 at 12:06
Maybe my answer at https://stats.stackexchange.com/questions/66186/statistical-interpretation-of-maximum-entropy-distribution/245198#245198 can help? — kjetil b halvorsen, Jul 29 '20 at 20:04
Who said *the distribution with the largest entropy should be chosen as the least-informative default* and in which context? — Richard Hardy, Aug 20 '20 at 14:58
The maximum entropy principle says it in the first paragraph of the following link. "least-informative default" in the sense that maximum entropy (the uniform distribution) is the most ignorant setting that makes the fewest distributional assumptions (equally weighted probabilities) https://en.wikipedia.org/wiki/Maximum_entropy_probability_distribution — develarist, Oct 07 '20 at 20:32
@mlofton I have a follow-up question: https://math.stackexchange.com/questions/3855776/if-a-zero-entropy-distribution-implies-high-information-a-priori-what-does-it-m — develarist, Oct 08 '20 at 15:10

Yaroslav Bulatov · Accepted Answer · 2020-10-07T21:58:14.267

5

Because "maxent" distribution is more "in the center". A formal description of this is in this paper -- "Game Theory, Maximum Entropy, Minimum Discrepancy, and Robust Bayesian Decision Theory". The basic idea if when you know some constraint is true, you better pick the maximum entropy distribution subject to this constraint, because it guarantees that you won't be too far from the worst-case true distribution (which could be hiding in the corner)

Here's an example -- spaces of all distributions over 3 outcomes with entropy contours

And here's the plot of entropy for all distributions. Picking highest entropy distribution gives you the one closest to the center, which also minimizes the distance (in the KL-divergence sense) to the furthermost point (aka, the potential true distribution)

One could visualize this in the original space with p1, p2, p3 being the 3 axes of the space of multinomial distributions over 3 outcomes

edited Oct 07 '20 at 21:58

answered Oct 07 '20 at 21:01

Yaroslav Bulatov

5,167
2
24
38

is there a correspondence or some sort of corollary between the "center of a distribution" and the distribution's sample mean? – develarist Oct 07 '20 at 21:03
1

"The center" here refers to center of the space of distributions. "Sample mean" is a moment, and there's a relationship between moments of distribution and being "close to the center" (equivalently, being "high entropy"). For instance E[XY]=E[X]E[Y] is one such relationship, your distribution has much higher entropy when this constraint is satisfied – Yaroslav Bulatov Oct 07 '20 at 21:06
looking at the suggested paper, i didn't find the two images displayed in your answer. where did you get them? i'd like to confirm what the meaning and units of the 3 vertices and 3 axes are, and how the number of random variables used to form them can be inferred – develarist Oct 07 '20 at 21:10
I made them myself. Thing of it as collection of all possible weighted 3-sided dice.The 3 vertices are 3 extremal distributions, (1,0,0), (0,1,0), (0,0,1), ie the dice lands on one of the faces with probability 1. The middle point is the dice where all sides are equally likely. Set of dice = set of multinomial distributions – Yaroslav Bulatov Oct 07 '20 at 21:17
so the $z$-axis (vertical) is entropy? ticks between the vertices would help since I don't see how the dice outcomes mix across the 3 'variables', nor how entropy is being calculated at each 'grid-point' – develarist Oct 07 '20 at 21:22
does the "worst case true distribution" have low or high information (relative to the maxent distribution)? – develarist Oct 07 '20 at 21:26
worst-case distribution will have low entropy. Ticks is kind of hard because the space of distribution over 3-sided die has 3 parameters. But they add up to 1, so only 2-dimensional slice of that space contains valid distributions. https://stackoverflow.com/questions/5097637/how-can-i-plot-a-function-defined-on-the-unit-simplex-in-mathematica – Yaroslav Bulatov Oct 07 '20 at 21:34
I've added a plot with 3 axes corresponding to probabilities of the corresponding die face, hopefully it's clearer – Yaroslav Bulatov Oct 07 '20 at 21:58
Any examples of doing the same for continuous random variables (>3 outcomes, histogram estimation of entropy)? – develarist Oct 08 '20 at 03:39

Why do we want a maximum entropy distribution, if it has the lowest information?

1 Answers1