Relationship between KL divergence and entropy

Question

In Bishop's Pattern Recognition and Machine Learning, there is a small discussion in section 10.1.2 of the difference between minimizing $D_{KL}(p \:||\: q)$ and $D_{KL}(q \:||\: p)$ with respect to the parameters of $q$, where $p$ is a known distribution. Specifically, they state that minimizing $D_{KL}(p \:||\: q)$ results in a broad distribution for $q$ that averages across multiple modes of $p$ while minimizing $D_{KL}(q \:||\: p)$ results in a $q$ is concentrated on a single mode of $p$. Intuitively, I see averaging across multiple modes as resulting in a $q$ which has greater entropy than one which is concentrated on a single mode. The figure below from the book summarizes this:

However, this contradicts the mathematics since, $$ D_{KL}(p \:||\: q) = H(p,q) - H(p) \tag{1}\label{eq:klpq} $$ and $$ D_{KL}(q \:||\: p) = H(q,p) - H(q) \tag{2}\label{eq:klqp} $$ as given here. Therefore, mathematically, minimizing $D_{KL}(q \:||\: p)$ is accomplished by encouraging larger values for the entropy of $q$ while minimizing $D_{KL}(p \:||\: q)$ effectively ignores the entropy of $q$.

Shouldn't the entropy of $q$ be greater when minimizing $D_{KL}(q \:||\: p)$ than when minimizing $D_{KL}(p \:||\: q)$ (each w.r.t. parameters of $q$)?

Look at the citations in my answer at https://stats.stackexchange.com/questions/87182/what-is-the-role-of-the-logarithm-in-shannons-entropy/463828#463828 — kjetil b halvorsen, May 09 '20 at 21:09
Your post isn't entirely clear: 1) Your definition of $\text{D}_\text{KL}$ seems to be inverted compared to the one used at https://stats.stackexchange.com/questions/188903/intuition-on-the-kullback-leibler-kl-divergence/189758#189758. Also, when you say minimize $\text{D}_\text{KL}(p || q)$, is it with respect to $p$ or $q$? — kjetil b halvorsen, May 09 '20 at 21:48
Thanks for your comments. I was using the definition given in [Wikipedia](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence#Motivation). I meant minimizing with respect to parameters of $q$. — Vivek Subramanian, May 09 '20 at 21:53

score 8 · Answer 1 · answered May 10 '20 at 02:09

I will reformulate using my answer at Intuition on the Kullback-Leibler (KL) Divergence. Since I am a statistician, I am more comfortable with likelihoods than with entropies, but I also think that gives more intuition here. Now entropy $ \DeclareMathOperator{\E}{\mathbb{E}} H(X) =-\sum_x p(x) \log p(x) =-\E_p \log p$. So entropy is simply the negative expected loglikelihood. Similarly, cross-entropy is the (negative) expected loglikelihood, but now not calculated under itself "its own truth", but under some other distribution, some other truth. We can use this to express the divergences as $$ \text{D}_\text{KL}(p || q) = \E_p \log p - \E_p \log q \label{*} \tag{*} $$ and $$ \text{D}_\text{KL}(q || p) = \E_q \log q - \E_q \log p \label{**} \tag{**} $$ Using the interpretation from the above linked post, $\text{D}_\text{KL}(p || q)$ is the expectation under the alternative of the log likelihood ratio for testing $H_0\colon q$ against the alternative $p$. In your setting, $p$ is a known distribution, and we suppose it is multimodal. In case $\eqref{*}$, $q$ is the null distribution and the divergence is small for such $q$ which is difficult to reject against alternative $p$, when the expectation is calculated under $p$. But that means all the modes of $p$ will contribute to the expectation,so better that $q$ have some mass in all those modes. So this minimization must indeed give distributions that have some probability everywhere $p$ has it.

Then turn to $\eqref{**}$. Now $p$ is the null hypothesis and the divergence is small when it is difficult to reject $p$ against the alternative $q$, and the expectation is calculated under $q$. So in this case, if $q$ omits some of the modes of $p$, that does not matter, because the large likelihood ratio there does not contribute to the expectation! So the conclusion in the book is indeed correct.

Another comment: In $\eqref{*}$, if the known distribution $p$ is the empirical data distribution, we find a model $q$ best approximating it, we get effectively maximum likelihood.

Lerner Zhang · Answer 2 · 2020-05-10T10:29:30.410

Firstly, the KL divergence $D_{KL}(p||q)$ is synonymous with the relative entropy, relative entropy of $p$ with respect to $q$.

Entropy then becomes the self‐information of a random variable. Mutual information is a special case of a more general quantity called relative entropy, which is a measure of the distance between two probability distributions.

Source: Entropy, Relative Entropy and Mutual Information

Secondly, the entropy is related to the relative entropy, relative entropy of $p$ with respect to a uniform distribution.

Intuitively, the entropy of a random variable X with a probability distribution $p(x)$ is related to how much $p(x)$ diverges from the uniform distribution on the support of $X$. The more $p(x)$ diverges the lesser its entropy and vice versa.

\begin{align} H(X) &= \sum_{x\in X}p(x)\log \frac{1}{p(x)} \\ &=log|X| - \sum_{x\in X}p(x)\log\frac{p(x)}{\frac{1}{|X|}} \\ &=\log|X|-D(p||uniform) \end{align}

Source: section 1.2 of COS597D: Information Theory in Computer Science

Then your question(the title) can be rephrased into "the relationship between relative entropy of $p$ relative to $q$ and relative entropy of $q$ relative to uniform distribution". It's impossible that how $p$ diverges from $q$ relates to how $q$ diverges from a uniform distribution given that $p$ and $q$ can be any distribution pairs.

Shouldn't the entropy of $q$ be greater when minimizing $D_{KL}(q \:||\: p)$ than when minimizing $D_{KL}(p \:||\: q)$ (each w.r.t. parameters of $q$)?

NO.

The two relative entropy just don't correlate two each other. The relative entropy of $q$ relative to a uniform distribution doesn't depend on which relative entropy you are minimizing(because $p$ can be varying).

Relationship between KL divergence and entropy

2 Answers2

Linked