3

The forward/reverse formulations of KL divergence are distinguished by having mean/mode-seeking behavior. The typical example for using KL to optimize a distribution $Q_\theta$ to fit a distribution $P$ (e.g. see this blog) is a bimodal true distribution $P$ and a unimodal Gaussian $Q_\theta$. In this case forward divergence $D_{KL}(P || Q_\theta)$ widens $Q_\theta$ so its mode is between the two modes of $P$ but covers all points where $P > 0$, and the reverse $D_{KL}(Q_\theta || P)$ forces $Q_\theta$ to cover one of the modes of $P$ leaving the other mode largely uncovered.

My question is, if it is known that $P$ is a unimodal Gaussian, is there a difference between the forward and reverse KL divergence? And if not, does that mean KL divergence is a proper distance metric for Gaussians (i.e. also symmetric)? Doing some computations with Gaussians I'm finding there seems to still be differences between the forward/reverse KL, but intuitively I don't understand why if the true distribution $P$ has a single mode.

adamconkey
  • 561
  • 4
  • 11

1 Answers1

1

In the case that both $P(x)$ and $Q_\theta(x)$ are unimodal Gaussian distributions, and we are estimating all the parameters (Thereby $\theta$ represents both $\mu$ and $\Sigma$), the global minimum of $D_{KL}(P\parallel Q_\theta)$ and $D_{KL}(Q_\theta\parallel P)$ will be reached with the same value of $\theta$, and this will be the $\theta$ that matches the mean and covariance matrix of $P(x)$. Here is why:

  • In the case where we assume that the true probability distribution is given by $P(x)$ (with $x$ as random variable), then we have that the KL divergence is given by: $$D_{KL}(P\parallel Q_\theta)=\mathbb{E}_{x\sim P}[\log P(x) - \log Q_\theta(x)]$$ So, the value of $\theta$ that minimizes this KL divergence can be addressed by: $$ \theta = \arg\min_\theta\mathbb{E}_{x\sim P}[- \log Q_\theta(x)]$$ So we will reach the minimum of that expression when $Q_\theta(x) = P(x)$ as $\mathbb{E}_{x\sim P}[- \log Q_\theta(x)]$ represents the cross entropy of $Q_\theta(x)$ w.r.t. $P(x)$.

  • On the other hand, if we assume that the true probability distribution is given by $Q_\theta(x)$, then the KL divergence between the probability distributions $P(x)$ and $Q_\theta(x)$ is given by: $$D_{KL}(Q_\theta\parallel P)=\mathbb{E}_{x\sim Q_\theta}[\log Q_\theta(x)-\log P(x)]$$ Where again, if both $P(x)$ and $Q(x)$ are unimodal Gaussian distributions, then the minimum will be reached when $Q_\theta(x) = P(x)$ due to the fact that $P(x)$ can be represented by $Q_\theta(x)$ if we set proper values for $\theta$.

In both cases the KL divergence value will be zero. But this does not mean that $D_{KL}(Q_\theta\parallel P)=D_{KL}(P\parallel Q_\theta)$ for every value of $\theta$ as we have saw above. In the first case the expectation is calculated with respect to $P(x)$ whereas in the second case the expectation is computed w.r.t. $Q_\theta(x)$.

This means that we will be weighting $[\log P(x) - \log Q_\theta(x)]$ and $[\log Q_\theta(x) -\log P(x)]$ in different ways in each case. So unless $P(x)=Q_\theta(x)$, we would have different KL divergences.

So in this sense, due to the fact that $D_{KL}(Q_\theta\parallel P)\neq D_{KL}(P\parallel Q_\theta)$ in general, we can't say that the KL divergence is a proper distance metric.

Javier TG
  • 1,068
  • 1
  • 5
  • 17
  • Can you say what happens if the family $Q_\theta$ does not contain normal distributions? (so the minimum value is larger than zero) – kjetil b halvorsen Sep 24 '20 at 20:52
  • If we considered that $Q_\theta$ does not contain normal distributions then the value of $\theta$ that minimizes $D_{KL}(P \parallel Q_\theta)$ would be diferent in general from the one that minimizes $D_{KL}(Q_\theta \parallel P)$ because KL divergence is asymmetric. – Javier TG Sep 24 '20 at 21:03