14

I implemented the following function to calculate entropy:

from math import log

def calc_entropy(probs):
    my_sum = 0
    for p in probs:
        if p > 0:
            my_sum += p * log(p, 2)

    return - my_sum

Result:

>>> calc_entropy([1/7.0, 1/7.0, 5/7.0])
1.1488348542809168
>>> from scipy.stats import entropy # using a built-in package 
                                    # give the same answer
>>> entropy([1/7.0, 1/7.0, 5/7.0], base=2)
1.1488348542809166

My understanding was that entropy is between 0 and 1, 0 meaning very certain, and 1 meaning very uncertain. Why do I get measure of entropy greater than 1?

I know that if I increase size of log base, the entropy measure will be smaller, but I thought base 2 was standard, so I don't think that's the problem.

I must be missing something obvious, but what?

Akavall
  • 2,429
  • 2
  • 20
  • 27
  • Doesn't the base depend on the kind of entropy? Isn't base 2 Shannon entropy, and natural log statistical mechanics entropy? – Alexis Apr 26 '14 at 03:33
  • @Alexis, but doesn't Shannons's entropy range between 0 and 1? – Akavall Apr 26 '14 at 03:35
  • 1
    No: Shannon entropy is non-negative. – Alexis Apr 26 '14 at 04:23
  • 2
    It seems that there is nothing wrong with entropy being greater than 1 if I have more then two events, and value of entropy is between 0 and 1 only in special case, where my events are binary (I have two events). – Akavall Apr 26 '14 at 23:00

3 Answers3

21

Entropy is not the same as probability.

Entropy measures the "information" or "uncertainty" of a random variable. When you are using base 2, it is measured in bits; and there can be more than one bit of information in a variable.

In this example, one sample "contains" about 1.15 bits of information. In other words, if you were able to compress a series of samples perfectly, you would need that many bits per sample, on average.

CL.
  • 326
  • 4
  • 5
  • Thank You. I think I get it, but I want to make sure. I the following statement right? If I only have two outcomes, then most information I can obtain is 1 bit, but if I have more than 2 outcomes than I can obtain more than 1 bit of information. – Akavall May 05 '14 at 13:58
  • Yes. (For example, consider four uniformly distributed outcomes, which could be generated by *two* fair coin tosses per sample.) – CL. May 05 '14 at 14:04
  • 2
    To add to this, entropy ranges from 0-1 for binary classification problems and 0 to log base 2 k, where k is the number of classes you have. – MichaelMMeskhi Apr 08 '20 at 17:34
19

The maximum value of entropy is $\log k$, where $k$ is the number of categories you are using. Its numeric value will naturally depend on the base of logarithms you are using.

Using base 2 logarithms as an example, as in the question: $\log_2 1$ is $0$ and $\log_2 2$ is $1$, so a result greater than $1$ is definitely wrong if the number of categories is $1$ or $2$. A value greater than $1$ will be wrong if it exceeds $\log_2 k$.

In view of this it is fairly common to scale entropy by $\log k$, so that results then do fall between $0$ and $1$,

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
  • didn't know about that, thanks. So basically the base of logarithm thas to be the same as the length of the sample, and not more? – Fierce82 Jul 20 '18 at 11:44
  • 2
    The length of the sample is immaterial too. It's how many categories you have. – Nick Cox Jul 20 '18 at 11:49
  • just to clarify, is it k the number of possible categories, or number of categories youre calculating entropy for? eg. i have 10 possible categories, but there are 3 samples representing 2 categories in the system i am calculating entropy for. is k in this case 2? – eljusticiero67 Aug 20 '19 at 18:41
  • Categories that don't occur in practice have observed probability zero and don't affect the entropy result. It's a strong convention, which can be justified more rigorously, that $-0 \log 0$ is to be taken as zero (the base of logarithms being immaterial here). – Nick Cox Aug 20 '19 at 18:47
0

Earlier answers, specifically: "Entropy is not the same as probability." and "the maximum value of entropy is log " are both correct.

As stated earlier "Entropy measures the "information" or "uncertainty" of a random variable." Information can be measured in bits and when doing so log2 should be used. However, if a different information unit is used, the amount of information changes simply because the unit can encode more information. As an example, 1 bit can encode two events 0,1, while 1 ban can encode 10 different events, it follows then that 1 ban = 3.322 bits (3 bits = 8 events).

In summary, using entropy values between 0-1 and >1 is really no different as long as you use the same entropy units across comparisons. However, for some applications (Cross-entropy loss) using a value between 0 and 1 may be more convenient.

Juli
  • 240
  • 2
  • 7