Interpreting Shannon entropy

Question

From a computer simulation I have built a histogram of the results and normalized it so that the probability of finding a point $X$ in bin $b_j$ is $\sum_j P(X \in b_j) =1$. From this I have calculated the histogram's Shannon entropy $H$ in order to have some way to quantify the "predictivity" of $P$.

Now, while I get a number easily enough, I'm having a hard time understanding what I should do with it. My first thought was to compare $H$ for $P$ versus $H$ for the uniform distribution over the same $X$-range, since this has the maximal entropy (we know $X$ must belong to a finite range). Or I could compare the $X$-range to some "effective volume" $\Delta X$, where $\Delta X$ is the range over which a uniform distribution with the same $H$ as my histogram has been defined. I freely admit these aren't wonderful comparisons, since my histograms don't look at all like uniform distributions.

I work in a field that does not regularly use $H$ as a statistic, so I can't just give my reader a number and be done with it. However, I know it's a valuable quantity for my histogram. My question is: How would you report, describe, and compare the Shannon entropy for experimental/simulated histograms?

Missing from this question is the essential piece of information: *what do you want to learn from the simulation?* For examples of how people have used $H$ in contexts that may be similar to yours, please see http://stats.stackexchange.com/questions/49117, http://stats.stackexchange.com/questions/1841, and especially http://stats.stackexchange.com/questions/25235. — whuber, Dec 02 '13 at 17:23
@whuber -- Thanks for the links, but I'm not intending to use $H$ for hypothesis testing or the like. As I mentioned, I would just like to get some measure of how "predictive" the simulation was, in the sense that a large amount of the data is concentrated in small regions. Importantly, I need to reference this against something that my readers would be more familiar with, as we're not statisticians. — cosmoguy, Dec 02 '13 at 20:48
I think you might be barking up the wrong tree. Rather than telling us you want to use $H$ (or whatever), why not edit your question to explain more quantitatively and clearly what you need to accomplish and let your readers suggest appropriate ways to go about it? — whuber, Dec 02 '13 at 21:14
See also https://stats.stackexchange.com/questions/66186/statistical-interpretation-of-maximum-entropy-distribution/245198#245198 — kjetil b halvorsen, Aug 30 '17 at 21:42

Piotr Migdal · Answer 1 · 2014-02-19T20:48:08.767

It depends what you want to show, what is the variable:

categorical variable - it's fine
discrete by ordinal - it's a bit tricky
- e.g. on 1-5 scale it is something different to have the same probabilities for 1 and 5, and for 3 and 4
continuos variable - it's even more tricky
- the previous argument
- the choice of coordinates matter (good coordinates are ones respecting symmetries (and they not always exist))
- changing bin size scales entropy

So, I will mostly focus of the categorical variant.

Typical quantity you can use is Kullback-Leibler divergence, which means how different is your probability distribution $Y$ with respect to some initial one $X$.

$$ D_{KL}(Y||X) = \sum_x P(Y=x) \log \left(\frac{P(Y=x)}{P(X=x)} \right) $$ It can be interpreted as information gain - expecting probability distribution $X$ how much information you gained when you measured probability $Y$. If $X$ is uniform, then the KL divergence is just entropy of $X$ minus entropy of $Y$.

As an example, when you expect a coin to be fair $X=(\tfrac{1}{2}, \tfrac{1}{2})$, you toss it and get heads (and you are sure) $Y=(1,0)$, you learn exactly one bit of information.

When it comes to setting "uninformed" probability - it depends on the problem. For discrete case, just take maximum entropy distribution given the constraints. If there are no constraints, it is simply uniform probability. For linear constraints (that is, that some averages are fixed) there is a simple recipe to compute such distribution.

If there are a few different models, you can compare them measuring against the same $X$. The same can work for some ad hoc assumptions (for example uniform on some set, zero elsewhere).

If you have to normalize it, divide by entropy of the uninformed probability distribution $X$.

EDIT:

If you want to just tell how concentrated is the distribution, just use entropy of $Y$ (comparing it to entropy of $X$). In this case, lower is better.

Interpreting Shannon entropy

1 Answers1

Linked