In Bishop's Pattern Recognition and Machine Learning, there is a small discussion in section 10.1.2 of the difference between minimizing $D_{KL}(p \:||\: q)$ and $D_{KL}(q \:||\: p)$ with respect to the parameters of $q$, where $p$ is a known distribution. Specifically, they state that minimizing $D_{KL}(p \:||\: q)$ results in a broad distribution for $q$ that averages across multiple modes of $p$ while minimizing $D_{KL}(q \:||\: p)$ results in a $q$ is concentrated on a single mode of $p$. Intuitively, I see averaging across multiple modes as resulting in a $q$ which has greater entropy than one which is concentrated on a single mode. The figure below from the book summarizes this:
However, this contradicts the mathematics since, $$ D_{KL}(p \:||\: q) = H(p,q) - H(p) \tag{1}\label{eq:klpq} $$ and $$ D_{KL}(q \:||\: p) = H(q,p) - H(q) \tag{2}\label{eq:klqp} $$ as given here. Therefore, mathematically, minimizing $D_{KL}(q \:||\: p)$ is accomplished by encouraging larger values for the entropy of $q$ while minimizing $D_{KL}(p \:||\: q)$ effectively ignores the entropy of $q$.
Shouldn't the entropy of $q$ be greater when minimizing $D_{KL}(q \:||\: p)$ than when minimizing $D_{KL}(p \:||\: q)$ (each w.r.t. parameters of $q$)?