How much additional information should be provided if approximating a distribution? (KL Divergence)

Question

"If we use a distribution that is different from the true one, then we must necessarily have a less efficient coding, and on average the additional information that must be transmitted is (at least) equal to the Kullback-Leibler divergence between the two distributions."

Above is an extract from Bishop's book, Pattern Recognition and Machine Learning.

It specifically mentions that the additional information that must be transmitted, if approximating a distribution $p(x)$ by $q(x)$ is at least equal to the Kullbach-Leibler Divergence. I understand the equality, but are there cases wherein the information to be transmitted can be more than the KL divergence?

An example of the same would be great!

Thank you!

P.S. I'm working with the following definition of KL divergence, as mentioned in the book itself:

Consider some unknown distribution $p(x)$, and suppose that we have modelled this using an approximating distribution $q(x)$. If we use $q(x)$ to construct a coding scheme for the purpose of transmitting values of $x$ to a receiver, then the average additional amount of information (in nats) required to specify the value of x (assuming we choose an efficient coding scheme) as a result of using $q(x)$ instead of the true distribution $p(x)$ is given by KL($p||q$).

P.P.S. As a follow-up, what did the author exactly mean by less efficient coding? I was wondering if knowing that would get me closer to solving my question.

I can offer my answer here (https://stats.stackexchange.com/a/464528/169343), which also has some examples. — Igor F., Jul 13 '20 at 12:05

moh · Accepted Answer · 2020-07-23T18:23:49.143

First, you need to know what is the optimal coding, then anything else will be an equivalent or less efficient coding.

A sufficient condition for designing an optimal coding is knowing the (exact) true probability, $p(x)$, for every data, $x$, in the target system.

In practice, we only have access to an estimation, $q(x)$, of the true probability. As there is always error in estimation, there might be some $x_1$ and $x_2$ such that the $p(x_1) > p(x_2)$ but $q(x_1) < q(x_2)$.

This means that your coding allocates a smaller code to $x_2$ than the code that is allocated to $x_1$. For example, if you use $p(x)$ then $x_1$ might be coded as 01 and $x_2$ as 011. But if you use $q(x)$ then $x_1$ is coded as 011 and $x_2$ as 01.

Now, when you are using the system in practice, you need to send more of $x_1$ than $x_2$ and in average you are sending more bits than if you knew the $p(x$). So, this KL difference will gives you a hint that , on average, how much more bits your system is consuming.

As it's written in the famous book Elements of Information Theory:

"If we knew the true distribution $p$ of the random variable, we could construct a code with average description length $H(p)$. If, instead, we used the code for a distribution $q$, we would need $H(p)+ D(p||q)$ bits on the average to describe the random variable."

How much additional information should be provided if approximating a distribution? (KL Divergence)

1 Answers1