KL divergence bounds square of L1 norm

Question

In Cover & Thomas, Elements of Information Theory, at the section on Conditional Limit Theorem (11.6), it is proved that the KL divergence bounds the $\cal{L}_1$-norm from above,

$\frac{1}{2\ln2}\|p_1-p_2\|_1^2\leq D[p_1||p_2]$

The presented proof for binary random variables is rather straightforward. However the proof of the general case is unclear, mentioning two (different?) distributions of (a single?) indicator function, and some not so clearly stated version of the Data Processing Inequality.

I would be glad for either (i) a clarification of the proof in Cover, or (ii) some other simple proof for the non-binary case.

score 3 · Accepted Answer · edited Aug 13 '17 at 14:41

I also bumped into this passage recently! I'm not very familiar with probability/information theory, so I hope this makes sense and my notation is understandable; I tried for precision at the expense of brevity, but there's some notation in the book that I just don't know how to use precisely.

As far as I can tell, the "data-processing inequality" for KL divergence (aka relative entropy) is proved "in the same way as the data-processing inequality for mutual information" in the sense that they both involve expanding a certain quantity in two ways with a chain rule and then bounding parts of the expansion, even though the chain rule for mutual information (Theorem 2.5.2) and the chain rule for relative entropy (Theorem 2.5.3) don't seem analogous to me, except in some intuitive sense.

In its most general form, the data-processing inequality for relative entropy is probably something like this:

Theorem. Let $X_1$ and $X_2$ be two random variables with the same set of possible values $\mathcal{X}$ and probability distributions $P_1$ and $P_2$. Let $Y_1$ and $Y_2$ be two random variables with the same set of possible values $\mathcal{Y}$, that depend on $X_1$ and $X_2$, respectively, in the same way; that is, we draw $Y_1$ from a distribution depending on only $X_1$ and we draw $Y_2$ from a distribution depending on only $X_2$, and when $X_1 = X_2$, the distributions of $Y_1$ and $Y_2$ are identical. If the probability distributions of $Y_1$ and $Y_2$ are $\hat{P_1}$ and $\hat{P_2}$, then

$$ D(P_1\| P_2) \geq D(\hat{P_1}\| \hat{P_2}). $$

Note: We can in particular pick a function $f : \mathcal{X} \to \mathcal{Y}$ and just take $Y_1$ and $Y_2$ to be $f(X_1)$ and $f(X_2)$. This is how this inequality is applied in the proof of Theorem 11.6; $f$ is taken to be the indicator function of the set $$ A := \{x \in \mathcal{X} : P_1(x) > P_2(x)\}, $$ which reduces the theorem for general random variables to the case of binary random variables.

Proof: By applying the chain rule for relative entropy (Theorem 2.5.3), we can expand the KL divergence between the joint distributions $D(P_1, \hat{P_1}\| P_2, \hat{P_2})$ (over the values $\mathcal{X} \times \mathcal{Y}$) in two ways: $$ \begin{align*} D(P_1, \hat{P_1}\| P_2, \hat{P_2}) &= D(\hat{P_1}\|\hat{P_2}) + D(P_1(X_1|Y_1)\| P_2(X_2|Y_2))\\ &= D(P_1\| P_2) + D(P_1(Y_1|X_1)\| P_2(Y_2|X_2)). \end{align*} $$

The last term in each line is a conditional relative entropy, as defined in (2.66) as $$ D(p(y|x)\| q(y|x)) := \mathbb{E}_{p(x,y)} \log \frac{p(Y|X)}{q(Y|X)}. $$

Since the distributions of $Y_1$ given $X_1$ and the distribution of $Y_2$ given $X_2$ are identical, the conditional relative entropy between them is 0 (equation 2.91), i.e. $$ D(P_1(Y_1|X_1)\|P_2(Y_2|X_2)) = 0 $$

Since conditional relative entropies are nonnegative (equation 2.91), $$ D(P_1(X_1|Y_1)\| P_2(X_2|Y_2)) \geq 0 $$

Therefore, $$ \begin{align*} D(P_1\| P_2) &= D(\hat{P_1}\|\hat{P_2}) + D(P_1(X_1|Y_1)\| P_2(X_2|Y_2)) \\ &\geq D(\hat{P_1}\|\hat{P_2}), \end{align*} $$ as desired.

Note 2: These lecture notes seem to have a proof of a more general version of this theorem, which also applies to more general divergences, although I can't say I understand it.

Thank you @betaveros, and sorry for the delayed response. Your proof of the relative-entropy form of the DPI seems correct, and if so is indeed much simpler than the one at the lecture notes you've linked to. Understanding the proof given in those notes requires some understanding of the measure-theoretic definition of the notion of conditional expectation, and is thus not very accessible. However, I find the statement of the _relative-entropy DPI_ to be much clearer in these notes. — Shlomi A, Aug 13 '17 at 14:34

KL divergence bounds square of L1 norm

1 Answers1