18

This question gives a quantitative definition of cross entropy, in terms of it's formula.

I'm looking for a more notional definition, wikipedia says:

In information theory, the cross entropy between two probability distributions measures the average number of bits needed to identify an event from a set of possibilities, if a coding scheme is used based on a given probability distribution q, rather than the "true" distribution p.

I have Emphasised the part that is giving me trouble in understanding this. I would like a nice definition that doesn't require separate (pre-existing) understanding of Entropy.

Lyndon White
  • 2,744
  • 1
  • 19
  • 35
  • 1
    You are asking for a definition of _cross_-entropy that, at the same time, will define _entropy_ itself. And intuitively so... If you have trouble understanding the concept of Entropy itself, it would be a good idea to first understand the basic concept and then any one of its extensions. – Alecos Papadopoulos Jan 01 '14 at 15:44
  • 1
    Personally I had a basic understanding of Entropy (though it has been almost 12 months since I applied it). But a quantitive expression of Entropy, should fit in one short paragraph, and cross entropy should only take one more. So I feel a good answer can include both, so that the reader doesn't need to refer elsewhere to understand it. – Lyndon White Jan 01 '14 at 16:54
  • See the related posts: https://stats.stackexchange.com/questions/66186/statistical-interpretation-of-maximum-entropy-distribution/245198#245198 and https://stats.stackexchange.com/questions/188903/intuition-on-the-kullback-leibler-kl-divergence/189758#189758 – kjetil b halvorsen Sep 01 '19 at 11:00

1 Answers1

26

To encode an event occurring with probability $p$ you need at least $\log_2(1/p)$ bits (why? see my answer on "What is the role of the logarithm in Shannon's entropy?").

So in optimal encoding the average length of encoded message is $$ \sum_i p_i \log_2(\tfrac{1}{p_i}), $$ that is, Shannon entropy of the original probability distribution.

However, if for probability distribution $P$ you use encoding which is optimal for a different probability distribution $Q$, then the average length of the encoded message is $$ \sum_i p_i \text{code_length($i$)} = \sum_i p_i \log_2(\tfrac{1}{q_i}), $$ is cross entropy, which is greater than $\sum_i p_i \log_2(\tfrac{1}{p_i})$.

As an example, consider alphabet of four letters (A, B, C, D), but with A and B having the same frequency and C and D not appearing at all. So the probability is $P=(\tfrac{1}{2}, \tfrac{1}{2}, 0, 0)$.

Then if we want to encode it optimally, we encode A as 0 and B as 1, so we get one bit of encoded message per one letter. (And it is exactly Shannon entropy of our probability distribution.)

But if we have the same probability $P$, but we encode it according to distribution where all letters are equally probably $Q=(\tfrac{1}{4},\tfrac{1}{4},\tfrac{1}{4},\tfrac{1}{4})$, then we get two bits per letter (for example, we encode A as 00, B as 01, C as 10 and D as 11).

Piotr Migdal
  • 5,586
  • 2
  • 26
  • 70
  • Nice explanation, thanks. However, the wikipedia definition is sum_i[p_i * log(q_i)]. Your use of 1/q_i gives the number of possible states, hence log_2 converts that into the number of bits required to encode a single symbol, but the wikipedia page is describing something subtly different. – redcalx Jun 26 '15 at 20:14
  • 6
    @locster In Wikipedia it has the minus sign before the sum, which is equivalent to having $1/q_i$, as $\log(1/q_i)=-\log(q_i)$. – Piotr Migdal Jun 27 '15 at 09:26