Difference of two KL-divergence

Question

The Kullback-Leibler (KL) divergence between two distributions $P$ and $Q$ is defined as $$\mbox{KL}(P \| Q) = \mathbb{E}_P\left[\ln \frac{\mbox{d}P}{\mbox{d}Q}\right].$$

My question is that suppose there are three distributions $P, Q, R$, is it possible to give an upper bound for the difference of KL-divergence? More precisely $$\mbox{KL}(P\|R) - \mbox{KL}(Q\|R) \leq~?$$ The right-hand side should demonstrate some difference between $P$ and $Q$ and be independent of $R$. [I guess some additional boundedness assumptions are necessary.]

A desired result for the squared norm is as follows. For any three vectors $x,y,z\in \mathbb{R}^d$, we have $$\|x - z\|^2 - \|y-z\|^2 = \langle x - y, x+y -2z\rangle \leq 4D\|x - y\|,$$ where $D$ is the maximal norm of the vectors, namely, $\|x\|, \|y\|, \|z\| \leq D$.

Is it possible to achieve a similar result for general distributions with KL-divergence? Thanks!

itdxer · Answer 1 · 2020-04-07T13:38:54.090

I don't think there is an upper bound that doesn't involve having constrains on $R$.

In order to see this, you can think of a special case where $Q=R$, which means $\mbox{KL}(Q||R)=0$. In this case, you just need to find finite upper bound for the $\mbox{KL}(P||R)$ which doesn't exist for any possible distribution, because KL divergence approaches infinity when one of the probabilities in $R$ approaches $0$.

One obvious way to bound $R$ is by ensuring that every value is bounded by some variable $\epsilon$, such that $R(x) \ge \epsilon$ for every possible $x$. This restriction limits distribution families that you are allow to use, because values should have bounded domain (for example, it cannot be gaussian distribution). With this assumption we can find upper bound for for the discrete distributions (but the same could be done for the continuous distributions as well)

$$ \begin{align} \mbox{KL}(P\|R) - \mbox{KL}(Q\|R) &= H(P) - H(Q) - \sum_{i=1}^{N}(p_i - q_i) \log r_i \\ &\le H(P) - H(Q) - \sum_{i=1}^{N}|p_i - q_i| \log r_i \\ &\le H(P) - H(Q) - \log \epsilon \sum_{i=1}^{N}|p_i - q_i| \\ \end{align} $$

where $H(P)$ is an entropy of $P$ and $N$ is a number of categories in a distribution.

In addition, it might be important to note that $\epsilon \le \frac{1}{N}$, where equality holds for the discrete uniformal distribution, otherwise, for larger values, sum over all probabilities will be greater than $1$.

UPDATE:

In general, it looks to me that any constrain (or set of constrains) should be quite restrictive. The difference could be written in the following way

$$ \begin{align} \mbox{KL}(P\|R) - \mbox{KL}(Q\|R) &= H(P) - H(Q) - \sum_{i=1}^{N}(p_i - q_i) \log r_i \end{align} $$

$H(P)$ and $H(Q)$ have a finite upper and lower bounds which means that difference approaches infinity only when the last sum $\sum_{i=1}^{N}(p_i - q_i) \log r_i$ approaches negative infinity. The $\log r_i$ term is the only part of this sum that can push everything to infinity. There are two ways to prevent this from happening. $\log r_i$ term can be controlled directly, meaning that we need to make sure that $r_i$ never approaches $0$ (exactly what I did) or we need to control $\log r_i$ with the $(p_i - q_i)$ term. One way to do it is to make sure that any time we have case where $r_i$ approaches $0$ the difference between $(p_i - q_i)$ also should approach $0$. This last observation creates a dependency between all 3 distributions, basically, when $r_i$ approaches $0$, $p_i$ should approach $q_i$. Another way to do it is to make sure that values approaching infinity and negative infinity can cancel each other out (e.g. $p_i - q_i = a$ and $p_j - q_j = -a$ at least for the cases when $r_i$ approaches 0).

The desired upper bound is allowed to be a function of $P$ and $Q$, but independent of $R$. For the special case of $Q=R$ as mentioned, $\mbox{KL}(P\| R) - \mbox{KL}(Q\| R) = \mbox{KL}(P\| Q) - \mbox{KL}(Q\| Q) = \mbox{KL}(P\| Q)$, this is a desired bound (measuring difference between $P$ and $Q$ without involving $R$)! — Matics, Apr 07 '20 at 12:05
Maybe I'm missing something, but would you say that upper bound in my answers depends on $R$? It's controlled by $\epsilon$ parameter, which is just a part of the constraint and it will work for any distribution $R$ that follows specified constraint. Or do you mean that constrain on distribution $R$ should be dependent on $P$ and $Q$ distributions? — itdxer, Apr 07 '20 at 12:37
Of course, the bound you provided makes sense, thanks! But it is not exactly what I am pursuing. Two reasons: (1) the uniform lower bound for distribution $R$ would be too restricted, which might result in a very loose upper bound (for example, when $\epsilon$ is very small); (2) the upper bound does not reflect the difference of $P$ and $Q$. Probably my requirement description is kind of vague, I guess the example of square functions in the question could better illustrate the requirements (upper bounded by a distance between $P$ and $Q$, with some **general** bounded assumptions on $P,Q,R$). — Matics, Apr 07 '20 at 12:46
I don't think you can do it without making distributions either too restricted or dependent. I added a few more observations to my answer — itdxer, Apr 07 '20 at 13:40

Difference of two KL-divergence

1 Answers1