I have a course on Information theory, in the which we talk about forward KLD in order to approximate pdfs. There is an example that's the same example as on this blog :
https://towardsdatascience.com/forward-and-reverse-kl-divergence-906625f1df06
($q_\theta(x)$ is the distribution we're computing, $p(x)$ is the true distribution)
In it the author writes :
In words, wherever p(x) has high probability, q(x) must also have high probability. This is mean-seeking behaviour, because q(x) must cover all the modes and regions of high probability in p(x), but q(x) is not penalized for having high probability masses where p(x) does not.
I simply do not understand the logic behind this.
Taking into account the presence of a log in
$arg_\theta $ $ max E_p(log(q_\theta(x)) = arg_\theta $ $ min $ $H(p, q(x))$ (cross entropy)
This necessarily means that when p(x) is large, well to maximise this we want a small q(x), that way the logarithm will be a very small negative number, which when multiplied by -1 will give a big positive number.
I'm trying to wrap my head around the intuition behind this optimisation and why it leads in practice to mean seeking behaviour.