Jensen Shannon Divergence vs Kullback-Leibler Divergence?

Question

I know that KL Divergence is not symmetric and it cannot be strictly considered as a metric. If so, why is it used when JS Divergence satisfies the required properties for a metric?

Are there scenarios where KL divergence can be used but not JS Divergence or vice-versa?

They are both used, only it depends on the context. When it's clear that it's necessary to have a strict metric, e.g. when clustering is done, then JS is a more preferable choice. On the other hand, in model selection the usage of AIC which is based on KL is widespread. Akaike weights have a nice interpretation for which JS either can't provide a counterpart or it has yet to become popular. — James, Sep 29 '14 at 20:23
KL is very widely used in statistics, signal processing and machine learning, JS less so. One significant advantage of JS is that it is a metric --- symmetry and triangle inequality. See Endres and Schindelin, IEEE Trans Information Theory 49 (2003), pp 1858-1860. — billmrn, May 18 '21 at 04:50

score 11 · Answer 1 · answered Oct 31 '18 at 19:57

I found a very mature answer on the Quora and just put it here for people who look for it here:

The Kullback-Leibler divergence has a few nice properties, one of them being that $[;]$ kind of abhors regions where $()$ have non-null mass and $()$ has null mass. This might look like a bug, but it’s actually a feature in certain situations.

If you’re trying to find approximations for a complex (intractable) distribution $()$ by a (tractable) approximate distribution $()$ you want to be absolutely sure that any that would be very improbable to be drawn from $()$ would also be very improbable to be drawn from $()$. That KL have this property is easily shown: there’s a $()[()/()]$ in the integrand. When () is small but $()$ is not, that’s ok. But when $()$ is small, this grows very rapidly if $()$ isn’t also small. So, if you’re choosing $()$ to minimize $[;]$, it’s very improbable that $()$ will assign a lot of mass on regions where $()$ is near zero.

The Jensen-Shannon divergence don’t have this property. It is well behaved both when $()$ and $()$ are small. This means that it won’t penalize as much a distribution $()$ from which you can sample values that are impossible in $()$.

score 4 · Answer 2 · answered Jun 05 '21 at 21:46

I recently stumbled into a similar question.

To answer why an asymmetric divergence can be more favourable than a symmetric divergence, consider a scenario where you want to quantify the quality of a proposal distribution used in importance sampling (IS). If you are unfamiliar with IS, the key idea here is that to design an efficient IS scheme, your proposal distribution should have heavier tails than the target distribution.

Denote two distributions $H=\text{Normal}(0, 25)$ and $L=\text{Normal}(0, 1)$. Suppose you target $H$ with IS, using $L$ as the proposal distribution. To quantify the quality of your proposal distribution, you might compute the Jensen-Shannon (JS) divergence of $L,H$, and the Kullback-Leibler (KS) divergence of $L$ from $H$ and obtain some values. Both values should give you some sense of how good your proposal distribution $L$ is. Nothing to see here yet. However, consider reversing the setup, i.e., target $L$ with IS using $H$ as the proposal distribution. Here, the JS divergence would be the same due to its symmetric property, while KL of $H$ from $L$ would be much lower. In short, we expected using $H$ to target $L$ to be OK, and $L$ to target $H$ is not OK. KL divergence aligns with our expectation; $\text{KL}(H || L) > \text{KL}(L ||H)$. JS divergence doesn't.

This asymmetric property aligns with our goal in that it can correctly, loosely speaking, account for the direction of discrepancy between two distributions.

Another factor to consider is that sometimes it can be significantly more computationally challenging to compute JS divergence than KS divergence.

score 1 · Answer 3 · answered Sep 29 '14 at 20:33

KL divergence has clear information theoretical interpretation and is well-known; but I am first time to hear that the symmetrization of KL divergence is called JS divergence. The reason that JS-divergence is not so often used is probably that it is less well-known and does not offer must-have properties.

Jensen Shannon Divergence vs Kullback-Leibler Divergence?

3 Answers3