"If we use a distribution that is different from the true one, then we must necessarily have a less efficient coding, and on average the additional information that must be transmitted is (at least) equal to the Kullback-Leibler divergence between the two distributions."
Above is an extract from Bishop's book, Pattern Recognition and Machine Learning.
It specifically mentions that the additional information that must be transmitted, if approximating a distribution $p(x)$ by $q(x)$ is at least equal to the Kullbach-Leibler Divergence. I understand the equality, but are there cases wherein the information to be transmitted can be more than the KL divergence?
An example of the same would be great!
Thank you!
P.S. I'm working with the following definition of KL divergence, as mentioned in the book itself:
Consider some unknown distribution $p(x)$, and suppose that we have modelled this using an approximating distribution $q(x)$. If we use $q(x)$ to construct a coding scheme for the purpose of transmitting values of $x$ to a receiver, then the average additional amount of information (in nats) required to specify the value of x (assuming we choose an efficient coding scheme) as a result of using $q(x)$ instead of the true distribution $p(x)$ is given by KL($p||q$).
P.P.S. As a follow-up, what did the author exactly mean by less efficient coding? I was wondering if knowing that would get me closer to solving my question.