6

I have read what KL Divergence is about: assess differences in probability distributions between two sets.

I have also read, and digested, that it is emphatically not a true metric because of asymmetry.

Now I have been wanting to ask: I cannot find a simple textbook example of KL Divergence. Can someone point to me how it is used and demonstrated? Just to its main intuitive insights? Say for example, two $P$ and $Q$ that are of discrete distributions, or from the very simple distributions like uniform, gaussian, etc.

I would appreciate your contributions.

whuber
  • 281,159
  • 54
  • 637
  • 1,101
cgo
  • 7,445
  • 10
  • 42
  • 61
  • Information gain is used in decision trees. It uses the KL divergence to compare the class distribution of the child nodes. – Simone Nov 29 '16 at 15:02
  • The Kullback-Leibler divergence is also called the Kullback-Leibler distance since it is a form of distance between probability distributions. You can find it in advanced statistics books and in books on information theory. I will get you a list and put it in an answer. – Michael R. Chernick Nov 29 '16 at 15:07
  • 1
    Related: http://stats.stackexchange.com/questions/188903/intuition-on-the-kullback-leibler-kl-divergence/189758#189758 – kjetil b halvorsen Nov 29 '16 at 17:14
  • There are plenty of examples on our site. The first one I found in a search is at http://stats.stackexchange.com/questions/60680. – whuber Nov 29 '16 at 19:17
  • I took a look at all the links. Although there is a lot of information on KL divergence including wikipedia articles, blogs and journal articles, I did not see any textbooks listed. Since there are several mathematical statistics books on that deal with Bhattacharya distance and KL divergence I am voting to keep this question open. – Michael R. Chernick Jan 20 '17 at 20:24
  • 2
    @Michael, "Textbook example" means "the kind of example that could appear in an (elementary) textbook," not "an example that has appeared in a textbook." – whuber Jan 20 '17 at 22:17
  • @whuber That doesn't jive with many of the threads on this site where the OP asks for book recommendations. and several answers are given. – Michael R. Chernick Jan 20 '17 at 22:30
  • Searching for book recommendations I found 131 threads. Some for advanced books and others for elementary books – Michael R. Chernick Jan 20 '17 at 22:34
  • 2
    @Michael I'm talking about common usage. Finding threads that ask for textbooks is irrelevant. Search our site for ["textbook example"](http://stats.stackexchange.com/search?q=%22textbook+example%22) to see what I mean. Please note that the [tag:references] tag was *not* applied by the OP: it was applied by another user. I have rolled the question back to its original form to make this clear. – whuber Jan 21 '17 at 00:18
  • @whuber You can say what you want and whether it is true or not the OP said the following: I cannot find a simple textbook example of KL Divergence" . Why would you edit out a tag regardless of who placed it? – Michael R. Chernick Jan 21 '17 at 00:35
  • 3
    @Michael I didn't edit anything--I rolled the text back to the original, because including that tag changed the meaning of the question, profoundly. If this were (merely) a question looking for textbooks, it would be made Community Wiki and it would need very different answers than the nice one from David Kozak. Of course anybody can say what they want--but what matters is *evidence*, which I provided. You are welcome to ignore the evidence, but as a result it's unlikely I (or anyone else) would be interested in continuing to engage you in any conversation about the subject. – whuber Jan 21 '17 at 00:38

2 Answers2

4

An enlightening example is its use in Stochastic Neighborhood Embedding devised by Hinton and Roweis.

Essentially the authors are trying to represent data on a two or three dimensional manifold so that the data can be visually represented (similar in aim as PCA, for instance). The difference is that rather than preserve total variance in the data (as in PCA), SNE attempts to preserve local structure of the data -- if that is unclear, the KL divergence may help to illuminate what it means. To do this, they use a Gaussian kernel to estimate the probability that points $i$ and $j$ would be neighbours:

$$P_i = \sum_j p_{i,j}\qquad \text{ where }\qquad p_{i,j} = \frac{\exp(-\|x_i-x_j\|^2\;/\; 2\sigma_i^2)}{\sum_{k\neq l} \exp(-\| x_i - x_k \|^2\;/\;2\sigma_i^2)}$$

They then use a Gaussian kernel to find a probability density for the new points in the low dimensional space.

$$Q_i = \sum_j q_{i,j}\qquad \text{ where }\qquad q_{i,j} = \frac{\exp(-\|y_i-y_j\|^2)}{\sum_{k\neq l} \exp(-\| y_i - y_k \|^2)}\qquad\;\;$$

and they use a cost function $C=\sum_i D_{KL}(P_i||Q_i)$ to measure how well the low dimensional data represents the original data.

If we fix the index $i$ for a moment and just look at a single point $i$, expanding out the notation we get:

$$D_{KL}(P||Q) = \sum_j p_{i,j} log(\frac{p_{i,j}}{q_{i,j}})$$

Which is brilliant! For each point $i$, the KL divergence will be high if points which are close in high-dimensional space (large $p_{i,j}$) are far apart in low-dimensional space (small $q_{i,j}$). But it puts a much smaller penalty on points that are far apart in high-dimensional space which are close together in low-dimensions. In this way the asymmetry of the KL-divergence is actually beneficial!

If we were to find a minimum Cost, we would have a method that preserves well the local structure of the data, as the authors set out to do, and the KL divergence played a pivotal role.

David Kozak
  • 1,563
  • 8
  • 20
1

The K-L distance is also called relative entropy Books on Information Theory where it is discussed

  1. Elements of Information Theory, Second Edition by Thomas Cover and Joy Thomas, Wiley 2006.

  2. Information Theory and Statistics by Solomon Kullback, Dover paperback 1997. A reprint of an earlier book (Wiley 1959).

  3. Statistical Implications of Turing's Formula by Zhiyi Zhang, Wiley 2017.

There is a great deal of useful information on this site! See also Intuition on the Kullback-Leibler (KL) Divergence

Michael R. Chernick
  • 39,640
  • 28
  • 74
  • 143