4

When it comes to cross entropy, the assumption is that an optimal encoding would use $\log(\frac{1}{p_i})$ bits for a symbol with probability $p_i$. While I understand this from a theoretical standpoint, and this definition is of course useful for optimization & learning, I am hoping to:

  1. See a formal derivation of this intuitive result. That is, we should be able to conclude that $\sum_ip_i\log(1/p_i)$ (i.e. the expected message length) is optimal across all possible encodings for any messages drawn from the underlying distribution.
  2. See a practical example of this type of encoding (perhaps Huffman coding?)

For #2, for example, consider the following symbols and probabilities:

  • p(A) = 0.8. requires 0.32 bits
  • p(B) = 0.1. requires 2.32 bits
  • p(C) = 0.1, requires 2.32 bits

What would the optimal encoding of a message here actually look like in practice?

Related threads for reference:

Amelio Vazquez-Reina
  • 17,546
  • 26
  • 74
  • 110
  • 1
    To address your second question: Huffman coding generally will not approach the entropy bound. Two practical methods that will are arithmetic coding and ANS coding. – Brent Kerby Jul 31 '17 at 21:32

1 Answers1

4

See a formal derivation of this intuitive result

This is shown very nicely in the first few chapters of Thomas & Cover's Elements of Information Theory.

The proof is too long for an answer, but it generally goes as follows. First, by definition, a code must be uniquely decodable, meaning that there is only one way to decode it. Then, they show that prefix-free codes, a strict subset of uniquely-decodable codes, are just as efficient. Prefix-tree codes can be represented by trees, and the Kraft Inequality can be used to complete the average $\log\left(\frac{1}{p_i}\right)$ bits proof.

If you are willing to settle for a less rigorous proof, it can be shown through the Asymptotic Equipartition Property. Assume symbol $i$ is generated with probability $p_i$. Then taking the negative log of the probabilities, and noting that the log of a product is the product of a log, the law of large numbers says that there is a set of size $e^{n H}$ containing nearly all the probabilities of sets generated by this process.

Thus a code of average per-bit length $H$ is about as efficient as you can get. This is what you'd get if you'd assign length $-\log\left( \frac{1}{p_i} \right)$ to symbol $i$. You cannot assign a better length based on $-\log\left( \frac{1}{p_i} \right)$. The $q_i$ would have to be a distribution as well (if not, either the code could be made more efficient, or it would violate Kraft). Using Gibb's Inequality, since

$$ \ln x \leq x-1 $$

(formally, by derivation and checking $x = 1$), then

$$ - \sum_i \left[ p_i \ln\left(\frac{q_i}{p_i}\right) \right] \leq - \sum_i \left[ p_i \left(\frac{q_i}{p_i} - 1\right) \right] = - \sum_i \left[ q_i - p_i \right]. $$

The right hand side, however, is 0, as $p$ and $q$ are probability distributions.

See a practical example

Given a set of probabilities, you can construct a concrete code using Arithmetic Coding.


Looking back at the answer, it seems to me a bit long and rambling. Again, I suggest you go over the first chapters of Cover & Thomas.

Amelio Vazquez-Reina
  • 17,546
  • 26
  • 74
  • 110
Ami Tavory
  • 4,410
  • 11
  • 20
  • 1
    Thanks Ami, this is great. I am hoping we can improve the less rigorous perhaps together. One question here: You mention using a message length = $-\log(\frac{1}{p_i})$ (note the negative sign) twice in the text. This would always yield a negative # of bits ($p_i \le 1$). I assume this is a typo? – Amelio Vazquez-Reina Aug 03 '17 at 13:10