8

From my readings, I understand that:

  1. Mutual information $\mathit{(MI)}$ is a metric as it meets the triangle inequality, non-negativity, indiscernability and symmetry criteria.
  2. The Kullback–Leibler divergence $\mathit{(D_{KL})}$ is not a metric as it does not obey the triangle inequality

However, one answer on Cross Validated (Information gain, mutual information and related measures) [the second answer], it was shown that mutual information and Kullback–Leibler divergence are equivalent. How can this be given that $\mathit{MI}$ is a metric and $\mathit{D_{KL}}$ is not? I can only assume that I am missing something here.

Mari153
  • 385
  • 5
  • 16
  • mutual information is not a metric. Variation of information is a metric – develarist Sep 11 '20 at 04:21
  • @develarist - thanks for this. This is very interesting as going by a Google search, it is clear that mutual information is treated as a metric by some data analysts. – Mari153 Sep 11 '20 at 05:51

1 Answers1

11

Mutual information is not a metric. A metric $d$ satisfies the identity of indisceribles: $d(x, y) = 0$ if and only if $x = y$. This is not true of mutual information, which behaves in the opposite manner--zero mutual information implies that two random variables are independent (as far from identical as you can get). And, if two random variables are identical, they have maximal mutual information (as far from zero as you can get).

You're correct that KL divergence is not a metric. It's not symmetric and doesn't satisfy the triangle inequality.

Mutual information and KL divergence are not equivalent. However, the mutual information $I(X, Y)$ between random variables $X$ and $Y$ is given by the KL divergence between the joint distribution $p_{XY}$ and the product of the marginal distributions $p_X \otimes p_Y$ (what the joint distribution would be if $X$ and $Y$ were independent).

$$I(X, Y) = D_{KL}(p_{XY} \parallel p_X \otimes p_Y)$$

Although mutual information is not itself a metric, there are metrics based on it. For example, the variation of information:

$$VI(X, Y) = H(X, Y) - I(X, Y) = H(X) + H(Y) - 2 I(X, Y)$$

where $H(X)$ and $H(Y)$ are the marginal entropies and $H(X, Y)$ is the joint entropy.

user20160
  • 29,014
  • 3
  • 60
  • 99
  • 1
    This is ***most*** interesting as when I type the absolute phrase "mutual information in a metric" into Google, I get something like "about 367,000 results". Some 103 references on Google Scholar also return the use the phrase. So clearly there is some interpretation that MI is being treated as a metric. – Mari153 Sep 11 '20 at 05:46
  • 2
    @MurrayB By "metric", I mean a formal [distance metric](https://en.wikipedia.org/wiki/Metric_(mathematics)), satisfying the conditions you mentioned in the question (symmetry, triangle inequality, identity of indiscernibles). However, it's quite common to use the word "metric" more informally, to describe a way of measuring or quantifying something. I suspect this is probably what's happening in your google results. – user20160 Sep 11 '20 at 06:04
  • Thanks for this. I was assuming the same. The misuse of terms makes understanding maths, statistics and data analysis all the more challenging. – Mari153 Sep 11 '20 at 07:09