Is there an intuitive interpretation of mutual information values (bits, nits)?

Question

I understand how mutual information is calculated, and what it is addressing: how much the distribution of one variable changes conditional on the value of another variable. But I don't really understand what the output values of a mutual information calculation actually mean an an absolute sense. I know that 0 means that the variables are independent, and I know I can use those values in relative comparisons for feature selection without really going any deeper than this, but it'd still like to try and understand what the absolute values mean. For example (using python with this MI implementation):

$$X = U(0,1);\ \ n=10000\\ Y = X + U(0, 0.5)\\ MI(X, Y) \approx 0.92 $$

what does it mean that $MI(X,Y)=$0.92 nits?. Is this value actually related to the maximum compressibility of the data, or of the relationship between the variables? Or is it something else entirely?

See also https://stats.stackexchange.com/questions/66186/statistical-interpretation-of-maximum-entropy-distribution/245198#245198 — kjetil b halvorsen, Jan 28 '20 at 04:53

score 7 · Accepted Answer · answered Feb 04 '16 at 07:43

7

Let's begin with the definition of entropy

Citing from wikipedia

The entropy rate of a data source means the average number of bits per symbol needed to encode it.

If we use a fair coin, we will need a bit per case in order to store the outcomes. If we use a coin X whose probability of head is 0.999 we can use way less bits

E(X) = -(0.999*log(0.999,2)+0.001*log(0.001,2)) ~ 0.01

You can use techniques like Huffman coding in order to store the outcome efficiently.

Mutual information MI(X,Y) measures how many bits will you need in order to store the outcomes Y given that you know the value of X.

The bits/nits comes from the base of the log used in the entropy and mutual information formulas.

If you use log based 2, you get bits. If you use log based e (ln), you gets nits. Since we store data on computers that use a binary system, bits are the common and more intuitive unit.

answered Feb 04 '16 at 07:43

DaL

4,462
3
16
27

Hrm.. yes, I guess I already understand that. It doesn't seem reasonable that you can use <1 bit to store the distribution of a coin toss though. I mean, one bit is basically true or false, right? Or is bit used in a different sense in this context? – naught101 Feb 05 '16 at 02:57
Sorry for getting into to much details. I saw that you have an impressive profile but I thought it is better to be safe than sorry and other readers might find the explanation useful. You can store less than a bit for a coin toss when you are dealing with many coin tosses. For example, in the Huffman approach you can represent a run of 100 "1" in few bits and this way store each one of them in average of less than 1. – DaL Feb 07 '16 at 06:43
Ah! So the entropy/MI value is a per-sample average value, and the dataset I created actually needs 92000 nits to store. That makes a lot more sense, thanks! – naught101 Feb 07 '16 at 22:13

Is there an intuitive interpretation of mutual information values (bits, nits)?

1 Answers1