1

I was understanding cross-entropy and ended up understanding KL divergence. I learnt Cross entropy is Entropy + KL Divergence:

H(P, Q) = H(P) + D_KL(P||Q)

Minimizing Cross-entropy means minimizing KL Divergence. I further read that minimizing KL divergence means we are trying to make Q close to P. But, I really wanted to know why this happens? I read from many sources that when Q close to P, DKL close to zero but I didn't find any proper justification for this. I wonder if somebody has better insights on this.

  • KL divergence has a relationship to a distance distance, if P and Q are close means distance between them is getting closer to zero. Some useful answers here, relating KL to a metric: https://stats.stackexchange.com/q/1031 – msuzen Nov 08 '21 at 17:40
  • What is it that you don't understand? If P and Q are identical, Q doesn't diverge at all from P and the KL-divergence is accordingly zero. That's how it's designed. Or is it the math that you don't understand? Have you looked up the definition of KL-divergence? – Igor F. Nov 08 '21 at 17:43
  • @IgorF, yeah I understand KL-divergence will be zero when Q ~ P but I wanted to know what exactly happens when Q approaches P as I have a feeling that KL divergence will also getting smaller and finally becomes zero when Q = P. – Nisan Chhetri Nov 08 '21 at 17:48

1 Answers1

2

For discrete random variables $P$ and $Q$, the KL-divergence is defined as

$$ D_{KL}(P || Q) = \sum_x P(x) \ln\frac{P(x)}{Q(x)} $$

So, as $Q \rightarrow P$, the ratio $P(x)/Q(x)$ approaches $1$ for all $x$ and the logarithm $\ln P(x)/Q(x)$ approaches zero. As probabilities are bounded to the range $[0, 1]$, each term in the sum, $P(x) \ln\frac{P(x)}{Q(x)}$ also approaches zero and, consequently, the whole sum also approaches zero.

So far the mathematical formalism. For some intuition behind it, you may consult my answer here.

Igor F.
  • 6,004
  • 1
  • 16
  • 41
  • I completely agree with you and I am also aware of ratio of P & Q close to one once these distributions close to each other (intuitively as well). However, I am looking for more mathematical proof what exactly happens or how value of D_KL changes when Q approaches to P. I also wanted to understand this scenario for all possible distributions of P and Q. Is there any such analysis or evaluation? – Nisan Chhetri Nov 10 '21 at 22:26