3

It is said that the distribution with the largest entropy should be chosen as the least-informative default. That is, we should choose the distribution that maximizes entropy because it has the lowest information content, allowing us to be maximally surprised. Surprise, therefore, is synonymous with uncertainty.

Why do we want that though? Isn't the point of statistics is to estimate with minimal error or uncertainty? Don't we want to extract the most information we can from a dataset/random variable and its distribution?

develarist
  • 3,009
  • 8
  • 31
  • 1
    Hi: It's the least informative beforehand in the sense of making the least assumptions about the values of the distribution and still being subject to some constraint. ( the second moment being equal to $\sigma^2$). It's explained pretty nicely in the link below. The idea is that, BEFORE-HAND, you don't want to make any assumptions about the information in your sample. The distribution with the greatest entropy gives you that characterization. https://en.wikipedia.org/wiki/Maximum_entropy_probability_distribution. – mlofton Jul 29 '20 at 11:52
  • ok at what point does the max entropy become useful though in the sense of minimizing uncertainty in characterizing a dataset? – develarist Jul 29 '20 at 12:06
  • Maybe my answer at https://stats.stackexchange.com/questions/66186/statistical-interpretation-of-maximum-entropy-distribution/245198#245198 can help? – kjetil b halvorsen Jul 29 '20 at 20:04
  • i saw that before, it doesn't – develarist Jul 29 '20 at 20:49
  • 2
    Who said *the distribution with the largest entropy should be chosen as the least-informative default* and in which context? – Richard Hardy Aug 20 '20 at 14:58
  • The maximum entropy principle says it in the first paragraph of the following link. "least-informative default" in the sense that maximum entropy (the uniform distribution) is the most ignorant setting that makes the fewest distributional assumptions (equally weighted probabilities) https://en.wikipedia.org/wiki/Maximum_entropy_probability_distribution – develarist Oct 07 '20 at 20:32
  • @mlofton I have a follow-up question: https://math.stackexchange.com/questions/3855776/if-a-zero-entropy-distribution-implies-high-information-a-priori-what-does-it-m – develarist Oct 08 '20 at 15:10

1 Answers1

5

Because "maxent" distribution is more "in the center". A formal description of this is in this paper -- "Game Theory, Maximum Entropy, Minimum Discrepancy, and Robust Bayesian Decision Theory". The basic idea if when you know some constraint is true, you better pick the maximum entropy distribution subject to this constraint, because it guarantees that you won't be too far from the worst-case true distribution (which could be hiding in the corner)

Here's an example -- spaces of all distributions over 3 outcomes with entropy contours

enter image description here

And here's the plot of entropy for all distributions. Picking highest entropy distribution gives you the one closest to the center, which also minimizes the distance (in the KL-divergence sense) to the furthermost point (aka, the potential true distribution)

enter image description here

One could visualize this in the original space with p1, p2, p3 being the 3 axes of the space of multinomial distributions over 3 outcomes

enter image description here

Yaroslav Bulatov
  • 5,167
  • 2
  • 24
  • 38
  • is there a correspondence or some sort of corollary between the "center of a distribution" and the distribution's sample mean? – develarist Oct 07 '20 at 21:03
  • 1
    "The center" here refers to center of the space of distributions. "Sample mean" is a moment, and there's a relationship between moments of distribution and being "close to the center" (equivalently, being "high entropy"). For instance E[XY]=E[X]E[Y] is one such relationship, your distribution has much higher entropy when this constraint is satisfied – Yaroslav Bulatov Oct 07 '20 at 21:06
  • looking at the suggested paper, i didn't find the two images displayed in your answer. where did you get them? i'd like to confirm what the meaning and units of the 3 vertices and 3 axes are, and how the number of random variables used to form them can be inferred – develarist Oct 07 '20 at 21:10
  • I made them myself. Thing of it as collection of all possible weighted 3-sided dice.The 3 vertices are 3 extremal distributions, (1,0,0), (0,1,0), (0,0,1), ie the dice lands on one of the faces with probability 1. The middle point is the dice where all sides are equally likely. Set of dice = set of multinomial distributions – Yaroslav Bulatov Oct 07 '20 at 21:17
  • so the $z$-axis (vertical) is entropy? ticks between the vertices would help since I don't see how the dice outcomes mix across the 3 'variables', nor how entropy is being calculated at each 'grid-point' – develarist Oct 07 '20 at 21:22
  • does the "worst case true distribution" have low or high information (relative to the maxent distribution)? – develarist Oct 07 '20 at 21:26
  • worst-case distribution will have low entropy. Ticks is kind of hard because the space of distribution over 3-sided die has 3 parameters. But they add up to 1, so only 2-dimensional slice of that space contains valid distributions. https://stackoverflow.com/questions/5097637/how-can-i-plot-a-function-defined-on-the-unit-simplex-in-mathematica – Yaroslav Bulatov Oct 07 '20 at 21:34
  • I've added a plot with 3 axes corresponding to probabilities of the corresponding die face, hopefully it's clearer – Yaroslav Bulatov Oct 07 '20 at 21:58
  • Any examples of doing the same for continuous random variables (>3 outcomes, histogram estimation of entropy)? – develarist Oct 08 '20 at 03:39