9

"Entropy" roughly captures the degree of "information" in a probability distribution.

For discrete distributions there is a far more exact interpretation: The entropy of a discrete random variable is a lower bound on the expected number of bits required to transfer the result of the random variable.

But for a continuous random variable, there are uncountably infinite number of outcomes, so we cannot even begin to transfer which exact outcome has occurred in a finite string of bits.

What is an equivalent interpretation of entropy for continuous variables?

user56834
  • 2,157
  • 13
  • 35
  • 1
    Do you have any definition of "degree of information" in a probability distribution? – kjetil b halvorsen May 21 '18 at 22:08
  • @kjetilbhalverson, i don't see where you're going with this? Isn't the question pretty clear? – user56834 May 22 '18 at 03:28
  • 1
    I think good answers are given [here](https://stats.stackexchange.com/a/256238) and [here](https://stats.stackexchange.com/a/245198/21054). – COOLSerdash Jun 03 '18 at 08:38
  • @COOLSerdash perfect. Could you make an answer linking to those two, and I’ll give you the points. – user56834 Jun 03 '18 at 08:43
  • 1
    @Programmer2134 I really appreciate it but I don't feel comfortable just posting links without much context (which is discouraged here) and getting points for it. I'm sorry. – COOLSerdash Jun 03 '18 at 19:51

2 Answers2

6

Because of Limiting density of discrete points, the interpretation of $$S = -\sum_x p(x)\ln p(x)$$ cannot be generalized to $$S= -\int dx (p(x)\ln p(x))$$

Because the direct generalization leads to $$S= -\int dx p(x)\ln (p(x)dx) = -\int dx p(x)\ln (p(x)) -\int dx p(x)\ln (dx) $$ Clearly, $\ln dx$ explodes.

Intuitively, since $p(x)dx = 0$, so the reasoning of using fewer bits for encoding something that is more likely to happen does not hold. So, we need to find another way to interpret $S= -\int dx p(x)\ln (p(x)dx)$, and the choice is $KL$ divergence.

Say we have a uniform distribution $q(x)$ in the same state space, then we have $$KL(p(x)\Vert q(x)) = \int dx p(x) \ln (\frac{p(x)dx}{q(x)dx})$$ Since $q(x)$ is just a constant, so we effectively keep the form of $S= -\int dx (p(x)\ln (p(x)dx))$, and at the same time construct a well-defined quantity for the continuous distribution $p(x)$.

So from $KL$ divergence, the entropy of a continuous distribution $p(x)$ can be interpreted as:

If we use a uniform distribution for encoding $p(x)$, then how many bits that is unnecessary on average.

meTchaikovsky
  • 1,414
  • 1
  • 9
  • 23
  • Your last sentence comes to the topic of the question, but it doesn’t actualy answer it: What is the interpretation of this “intrinsic property” then, if it is not the number of bits? – user56834 Jun 03 '18 at 07:41
  • 2
    The entropy is the expectation of $\ln (1/P)$. It's a matter of mathematical education that people usually prefer to write the definiion as you do, to write first $-\ln P$ and then to take the minus sign outside the integral, but your answer needs that correction. . – Nick Cox Jun 03 '18 at 07:47
  • @ Nick Cox Thanks for pointing this out, I've edited that. – meTchaikovsky Jun 03 '18 at 08:20
  • @Programmer2134 I've edited my answer, I hope it addresses the question better. – meTchaikovsky Jun 03 '18 at 08:24
  • @Programmer2134 Thanks to your question, I found that I just totally misunderstood the interpretation of $-\int p(x) \ln p(x)$. I 've corrected my answer. – meTchaikovsky Jun 03 '18 at 09:33
  • If I have understood this answer correctly, is the last equation effectively saying that information is how non-uniform the distribution is relative to uniform? Intuitively this means information is how narrow the shape of the distribution is (so the more restricted the spead of probability density) the higher the information. So bascially could it also be reframed as the signal (1st moment) to noise (2nd moment) ratio? – ReneBt Jun 06 '18 at 09:27
  • @ ReneBt For the first question, I think so, the much closer a distribution is to a uniform distribution, the smaller the KL divergence. But I don't know whether I can say the sharper the distribution looks, the higher the information because the nice interpretation of entropy for discrete distribution fails because of limiting density of discrete points. For the second question, sorry, I'm just a novice of ML, I don't have an idea. – meTchaikovsky Jun 06 '18 at 10:04
2

You discretize the problem via a probability density. A continous random variable has a density $f(x)$, which locally approximates the probably $P(X\in [x,x+\delta x]) \approx f(x)\delta x$, which is now an analogue of the discrete case. And by the theory of calculus, your sums equivalently become integrals over you state space.

Alex R.
  • 13,097
  • 2
  • 25
  • 49
  • 2
    I may be missing something, but my question was about an interpretation. I know that an integral is a limit of sums. – user56834 May 21 '18 at 18:04
  • 2
    This is a very optimistic answer! I believe itis much more complicated – kjetil b halvorsen May 21 '18 at 22:06
  • @kjetilbhalvorsen: Yea, there are a lot of details shoved under-the-table here. For OP's benefit, see section 2.3.1 of this: https://www.crmarsh.com/static/pdf/Charles_Marsh_Continuous_Entropy.pdf – Alex R. May 21 '18 at 23:13
  • 1
    Could you elaborate this answer? It seems to suggest that you can approximate continuous entropy by discretizing the distribution with small bins. But your link shows this doesn't work, and even says "the formula for continuous entropy is not a derivation of anything" – Jonny Lomond May 30 '18 at 19:55