83

Why and when we should use Mutual Information over statistical correlation measurements such as "Pearson", "spearman", or "Kendall's tau" ?

SaZa
  • 975
  • 1
  • 7
  • 6

5 Answers5

124

Let's consider one fundamental concept of (linear) correlation, covariance (which is Pearson's correlation coefficient "un-standardized"). For two discrete random variables $X$ and $Y$ with probability mass functions $p(x)$, $p(y)$ and joint pmf $p(x,y)$ we have

$$\operatorname{Cov}(X,Y) = E(XY) - E(X)E(Y) = \sum_{x,y}p(x,y)xy - \left(\sum_xp(x)x\right)\cdot \left(\sum_yp(y)y\right)$$

$$\Rightarrow \operatorname{Cov}(X,Y) = \sum_{x,y}\left[p(x,y)-p(x)p(y)\right]xy$$

The Mutual Information between the two is defined as

$$I(X,Y) = E\left (\ln \frac{p(x,y)}{p(x)p(y)}\right)=\sum_{x,y}p(x,y)\left[\ln p(x,y)-\ln p(x)p(y)\right]$$

Compare the two: each contains a point-wise "measure" of "the distance of the two rv's from independence" as it is expressed by the distance of the joint pmf from the product of the marginal pmf's: the $\operatorname{Cov}(X,Y)$ has it as difference of levels, while $I(X,Y)$ has it as difference of logarithms.

And what do these measures do? In $\operatorname{Cov}(X,Y)$ they create a weighted sum of the product of the two random variables. In $I(X,Y)$ they create a weighted sum of their joint probabilities.

So with $\operatorname{Cov}(X,Y)$ we look at what non-independence does to their product, while in $I(X,Y)$ we look at what non-independence does to their joint probability distribution.

Reversely, $I(X,Y)$ is the average value of the logarithmic measure of distance from independence, while $\operatorname{Cov}(X,Y)$ is the weighted value of the levels-measure of distance from independence, weighted by the product of the two rv's.

So the two are not antagonistic—they are complementary, describing different aspects of the association between two random variables. One could comment that Mutual Information "is not concerned" whether the association is linear or not, while Covariance may be zero and the variables may still be stochastically dependent. On the other hand, Covariance can be calculated directly from a data sample without the need to actually know the probability distributions involved (since it is an expression involving moments of the distribution), while Mutual Information requires knowledge of the distributions, whose estimation, if unknown, is a much more delicate and uncertain work compared to the estimation of Covariance.

Scortchi - Reinstate Monica
  • 27,560
  • 8
  • 81
  • 248
Alecos Papadopoulos
  • 52,923
  • 5
  • 131
  • 241
  • 1
    @ Alecos Papadopoulos; Thanks for your comprehensive answer. – SaZa Jan 09 '14 at 09:25
  • 2
    I was asking myself the same question but I have not completely understood the answer. @ Alecos Papadopoulos: I understood that the dependance measured is not the same, okay. **So for what kind of relations beetween X and Y should we prefer mutual information I(X,Y) rather than Cov(X,Y)?** I had a strange example recently where Y was almost linearly dependent on X (it was nearly a straight line in a scatter plot) and **Corr(X,Y) was equal to 0.87** whereas **I(X,Y) was equal to 0.45**. So is there clearly some cases where one indicator should be choosen over the other one? Thanks for helping! – Gandhi91 Mar 25 '14 at 19:47
  • @Gandhi91 What was the entropy of $X$, $H(X)$ in this specific case? – Alecos Papadopoulos Mar 30 '14 at 00:59
  • 1
    This is a great and very clear answer. I was wondering if you have a readily available example where cov is 0, but pmi is not. – thang Jun 11 '17 at 23:54
  • @thang. Not really. One should be able to find an example where covariance is zero and at the same time have the joint distribution available, to calculate the mutual information (and the joint distribution would not be the product of the marginals, because we want the variables to not be independent). – Alecos Papadopoulos Jun 12 '17 at 19:25
  • "while Mutual Information requires knowledge of the distributions" ... this is not a distinguishing aspect of MI. Using Pearson's requires knowledge of a the presence of a Gaussian, which is a massive assumption, despite what is taught in stats classes. The onus of knowing the distribution does not go away with Pearson's, in fact it should be considered even higher (almost ALL processes will NOT be Gaussian). – Cybernetic Feb 17 '20 at 13:20
  • @Cybernetic Why do you think that Pearson's correlation coefficient is, or should be, Gaussian-specific? It is just covariance "relativized" by the geometric mean of the variances. – Alecos Papadopoulos Feb 17 '20 at 14:54
  • @AlecosPapadopoulos because such measures are used for tests, which rely on moments being available in the distribution. Correlations can be found for all kinds of spurious (non) associations, and unless a test can showcase its significance it is meaningless. No moments = no tests. Almost ALL real-world data is fat-tailed, meaning the CLT does NOT apply (despite what statisticians will try to teach), and moments are either ill-defined or non-existent. Fat tails don't converge to normality asymptotically. If you have no variance, you have no covariance. Must use an informational approach (MI). – Cybernetic Feb 17 '20 at 15:18
  • @Cybernetic These are issues unrelated to how a measure is defined. As for the famous fat tails, take a Cauchy and truncate it far-far into the tails. It will acquire all its moments, while for any real world sample, inference using truncated Cauchy, will be indistinguishable from inference based on untruncated Cauchy (plus we will have moment measures). Let's not forget that distributions with "infinite" support are just a mathematical convenience -and when it stops being convenient (i.e. when we lose the moments), we simply have to abandon it, since, after all, it was just for convenience. – Alecos Papadopoulos Feb 17 '20 at 15:51
  • @AlecosPapadopoulos what do you mean by point-wise "measure" ? Cant we take the integral? – GENIVI-LEARNER Feb 18 '20 at 14:51
  • @AlecosPapadopoulos also to double confirm, by point-wise measure, you meant for deviation of each rv from the mean. Right? – GENIVI-LEARNER Feb 18 '20 at 14:52
  • 1
    @GENIVI-LEARNER In the context of my answer "point-wise" means that deviations are measured point by point (i.e. value per value that the rv takes). And no, it is not deviation from the mean here, this is clearly stated in my answer. It is deviation from the Independence relation. – Alecos Papadopoulos Feb 18 '20 at 19:39
  • @AlecosPapadopoulos makes lot more sense. Thanks – GENIVI-LEARNER Feb 18 '20 at 20:24
  • @AlecosPapadopoulos these are issues entirely related to how the measure is defined. The definition of Pearson's correlation is based on covariance, which uses the mean in its formula. The presence of this moment is well-defined only for the Gaussian. The only reason statisticians think normality is not a strict requirement is because they think deviations from Gaussian are still Gaussian. They are not. Read up on the statistical consequences of fat tails. Statisticians don't understand probability. – Cybernetic Feb 19 '20 at 14:56
  • @Cybernetic "The mean is well defined only for the Gaussian"? Last time I checked, there is a really large number of distributions that have a finite mean. And as I already wrote, an otherwise inconsequential truncation allows us to have fat-tail/heavy-tail/long-tail/subexponential distributions together with finite moments (if they are missing in the first place). I understand your criticism for those statisticians that still cling myopically to the "Normal world", but this is a phenomenon of professional inertia. Finally, as my contribution to bold claims, nobody understands probability. – Alecos Papadopoulos Feb 19 '20 at 17:47
12

Here's an example.

In these two plots the correlation coefficient is zero. But we can get high shared mutual information even when the correlation is zero.

In the first, I see that if I have a high or low value of X then I'm likely to get a high value of Y. But if the value of X is moderate then I have a low value of Y. The first plot holds information about the mutual information shared by X and Y. In the second plot, X tells me nothing about Y.

Mutual Information vs Correlation

dennislendrem
  • 131
  • 1
  • 4
9

Mutual information is a distance between two probability distributions. Correlation is a linear distance between two random variables.

You can have a mutual information between any two probabilities defined for a set of symbols, while you cannot have a correlation between symbols that cannot naturally be mapped into a R^N space.

On the other hand, the mutual information does not make assumptions about some properties of the variables... If you are working with variables that are smooth, correlation may tell you more about them; for instance if their relationship is monotonic.

If you have some prior information, then you may be able to switch from one to another; in medical records you can map the symbols "has genotype A" as 1 and "does not have genotype A" into 0 and 1 values and see if this has some form of correlation with one sickness or another. Similarly, you can take a variable that is continuous (ex: salary), convert it into discrete categories and compute the mutual information between those categories and another set of symbols.

  • Correlation isn't a linear function. Should it say that correlation is a measure of the linear relationship between random variables? – Matthew Gunn Oct 20 '16 at 09:33
  • 1
    I think this: "You can have a mutual information between any two probabilities defined for a set of symbols, while you cannot have a correlation between symbols that cannot naturally be mapped into a R^N space" is probably the key. Corr doesn't make sense if you don't have a complete random variable; however, pmi makes sense even with just the pdf and sigma (the space). This is why in many applications where RVs don't make sense (e.g. NLP), pmi is used. – thang Jun 12 '17 at 00:02
5

Although both of them are a measure of relationship between features, the MI is more general than correlation coefficient (CE) sine the CE is only able to takes into account linear relationships but the MI can also handle non-linear relationships.

Hossein9
  • 59
  • 1
  • 1
  • That's not true. The Pearson correlation coefficient assumes normality and linearity of two random variables, alternatives like the non-parametric Spearman's do not. There only monotonicity between the two rvs is assumed. – meow Aug 02 '19 at 17:01
1

Mutual Information (MI) uses the concept entropy to specify how much common certainty are there in two data samples $X$ and $Y$ with distribution functions $p_{x}(x)$ and $p_y(y)$. Considering this interpretation of MI: $$I(X:Y) = H(X) + H(Y) - H(X,Y)$$ we see that the last part says about the dependency of variables. In case of independence the MI is zero and in case of a consistency between $X$ and $Y$ the MI is equal with the entropy of $X$ or $Y$. Though, the covariance measures only the distance of every data sample $(x,y)$ from the average ($\mu_X, \mu_Y)$. Therefore, Cov is only one part of MI. Another difference is the extra information that Cov can deliver about the sign of Cov. This type of knowledge can not extracted from MI because of log-function.