Qualitively what is Cross Entropy

Question

This question gives a quantitative definition of cross entropy, in terms of it's formula.

I'm looking for a more notional definition, wikipedia says:

In information theory, the cross entropy between two probability distributions measures the average number of bits needed to identify an event from a set of possibilities, if a coding scheme is used based on a given probability distribution q, rather than the "true" distribution p.

I have Emphasised the part that is giving me trouble in understanding this. I would like a nice definition that doesn't require separate (pre-existing) understanding of Entropy.

You are asking for a definition of _cross_-entropy that, at the same time, will define _entropy_ itself. And intuitively so... If you have trouble understanding the concept of Entropy itself, it would be a good idea to first understand the basic concept and then any one of its extensions. — Alecos Papadopoulos, Jan 01 '14 at 15:44
Personally I had a basic understanding of Entropy (though it has been almost 12 months since I applied it). But a quantitive expression of Entropy, should fit in one short paragraph, and cross entropy should only take one more. So I feel a good answer can include both, so that the reader doesn't need to refer elsewhere to understand it. — Lyndon White, Jan 01 '14 at 16:54
See the related posts: https://stats.stackexchange.com/questions/66186/statistical-interpretation-of-maximum-entropy-distribution/245198#245198 and https://stats.stackexchange.com/questions/188903/intuition-on-the-kullback-leibler-kl-divergence/189758#189758 — kjetil b halvorsen, Sep 01 '19 at 11:00

Piotr Migdal · Accepted Answer · 2017-05-18T18:33:34.343

To encode an event occurring with probability $p$ you need at least $\log_2(1/p)$ bits (why? see my answer on "What is the role of the logarithm in Shannon's entropy?").

So in optimal encoding the average length of encoded message is $$ \sum_i p_i \log_2(\tfrac{1}{p_i}), $$ that is, Shannon entropy of the original probability distribution.

However, if for probability distribution $P$ you use encoding which is optimal for a different probability distribution $Q$, then the average length of the encoded message is $$ \sum_i p_i \text{code_length($i$)} = \sum_i p_i \log_2(\tfrac{1}{q_i}), $$ is cross entropy, which is greater than $\sum_i p_i \log_2(\tfrac{1}{p_i})$.

As an example, consider alphabet of four letters (A, B, C, D), but with A and B having the same frequency and C and D not appearing at all. So the probability is $P=(\tfrac{1}{2}, \tfrac{1}{2}, 0, 0)$.

Then if we want to encode it optimally, we encode A as 0 and B as 1, so we get one bit of encoded message per one letter. (And it is exactly Shannon entropy of our probability distribution.)

But if we have the same probability $P$, but we encode it according to distribution where all letters are equally probably $Q=(\tfrac{1}{4},\tfrac{1}{4},\tfrac{1}{4},\tfrac{1}{4})$, then we get two bits per letter (for example, we encode A as 00, B as 01, C as 10 and D as 11).

Nice explanation, thanks. However, the wikipedia definition is sum_i[p_i * log(q_i)]. Your use of 1/q_i gives the number of possible states, hence log_2 converts that into the number of bits required to encode a single symbol, but the wikipedia page is describing something subtly different. — redcalx, Jun 26 '15 at 20:14
@locster In Wikipedia it has the minus sign before the sum, which is equivalent to having $1/q_i$, as $\log(1/q_i)=-\log(q_i)$. — Piotr Migdal, Jun 27 '15 at 09:26

Qualitively what is Cross Entropy

1 Answers1

Linked