Intuition for the uniform distribution having the maximum entropy

Question

I saw the following explanation for Entropy in probability:

(Entropy). The surprise of learning that an event with probability $p$ happened is defined as $\log_2(1/p)$, measured in a unit called bits. Low-probability events have high surprise, while an event with probability $1$ has zero surprise. The $\log$ is there so that if we observe two independent events $A$ and $B$, the total surprise is the same as the surprise from observing $A \cap B$. The $\log$ is base $2$ so that if we learn that an event with probability $1/2$ happened, the surprise is $1$, which corresponds to having received $1$ bit of information.

I then read this answer by user "mitchus".

Given these two descriptions, I am still unable to come to dispel an aspect of my confusion, and the more I think about it, the more confused I become. If entropy is the "surprise" of learning that an event with probability $p$ happened, then wouldn't the distribution with the highest entropy be the one with the most possible outcomes spread over the largest distance, so that there are many outcomes, each of which have a very low probability of occurring? Or does this actually describe a uniform distribution? Thank you.

Entropy is the ultimate data compression. Maybe [this answer](https://stats.stackexchange.com/a/383203/103153) of mine would be of any help. — Lerner Zhang, Feb 23 '20 at 13:52
@LernerZhang Your post seems to make the point that the fairer coin produces the most entropy? But wouldn’t the lower probability ($< 0.5$) outcomes produce greater entropy? — Dom Fomello, Feb 23 '20 at 14:17
No, because most of the time you aren't surprised very much, to anthropomorphize the math, and the log function is concave, so the amount of surprise when low probability events occur doesn't make up for the infrequency with which you are surprised. Consider the ultimate in "lower probability": $p=0$. You are never surprised. — jbowman, Feb 23 '20 at 15:38

score 7 · Accepted Answer · edited Feb 23 '20 at 20:48

Let us consider as an example a coin with $p=0.1$ provability of heads.

If we learn that the coin turned heads, our "surprise" is given by $\log_2{\frac{1}{0.1}}\approx3.3$, which is indeed greater than the 1 bit yielded by the fair coin. On the other hand, if we learn the coin turned tails, our surprise is only $\log_2{\frac{1}{0.9}}\approx0.15$, which is less than 1 bit. On average, the surprise we get is $0.9\cdot\log_2{\frac{1}{0.9}}+0.1\cdot\log_2\frac{1}{0.1}\approx0.47$, which is less than 1 bit!

The entropy of this unfair coin with $p=0.1$ is given by $0.47$, not $3.3$ or $0.15$. That is so because entropy is the average surprise across all possible outcomes, not the surprise associated with a single event. Algebraically, we can state that the entropy is given by:

$$H=\sum p_i\log_2 \frac{1}{p_i}$$

In the case of a coin experiment, this equation becomes $H=p\log_2\frac{1}{p}+(1-p)\log_2\frac{1}{1-p}$, which have a maximum point at $p=0.5$.

Intuition for the uniform distribution having the maximum entropy

1 Answers1