I was understanding cross-entropy and ended up understanding KL divergence. I learnt Cross entropy is Entropy + KL Divergence:
H(P, Q) = H(P) + D_KL(P||Q)
Minimizing Cross-entropy means minimizing KL Divergence. I further read that minimizing KL divergence means we are trying to make Q close to P. But, I really wanted to know why this happens? I read from many sources that when Q close to P, DKL close to zero but I didn't find any proper justification for this. I wonder if somebody has better insights on this.