Why don't statisticians use mutual information as a measure of association?

Question

I've seen a couple talks by non-statisticians where they seem to reinvent correlation measures using mutual information rather than regression (or equivalent/closely-related statistical tests).

I take it there's a good reason statisticians don't take this approach. My layman's understanding is that estimators of entropy / mutual information tend to be problematic and unstable. I assume power is also problematic as a result: they try to get around this by claiming that they're not using a parametric testing framework. Usually this kind of work doesn't bother with power calculations, or even confidence/credible intervals.

But to take a devil's advocate position, is slow convergence that big of a deal when datasets are extremely large? Also, sometimes these methods seem to "work" in the sense that the associations are validated by follow-up studies. What's the best critique against using mutual information as a measure of association and why isn't it widely used in statistical practice?

edit: Also, are there any good papers that cover these issues?

MI is a measure of association between two discrete variables. It's not really that common a setting in general statistics (could be in some specialized sub fields). But within that setting, I see it used frequently enough. Certainly, when I encounter applied people using Pearson correlation on bivariate discrete datasets, I point out MI to them. — user603, Aug 22 '13 at 03:00
See also http://stats.stackexchange.com/questions/1052/what-is-the-major-difference-between-correlation-and-mutual-information However, the discussion here is already, in my view, as good or better, so the usual question about duplicates is moot. — Nick Cox, Aug 22 '13 at 09:33
Also for references see http://stats.stackexchange.com/q/20011/1036 — Andy W, Aug 22 '13 at 11:56
A further general reference is Matthew Reimherr and Dan L. Nicolae. 2013. On Quantifying Dependence: A Framework for Developing Interpretable Measures. _Statistical Science_ 28: 116-130. — Nick Cox, Aug 23 '13 at 11:03

score 5 · Answer 1 · edited Aug 22 '13 at 09:30

5

I think you should distinguish between categorical (discrete) data and continuous data.

For continuous data, Pearson correlation measures a linear (monotonic) relationship, rank correlation a monotonic relationship.

MI on the other hand "detects" any relationship. This is normally not what you are interested in and/or is likely to be noise. In particular, you have to estimate the density of the distribution. But since it is continuous, you would first create a histogram [discrete bins], and then calculate MI. But since MI allows for any relationship, the MI will change as you use smaller bins (i.e. so you allow more wiggles). So you can see that the estimation of MI will be very unstable, not allowing you to put any confidence intervals on the estimate etc. [Same goes if you do a continuous density estimate.] Basically there are too many things to estimate before actually calculating the MI.

Categorical data on the other hand fits quite nicely into MI framework (see G-test), and there is not much to choose between G-test and chi-squared.

edited Aug 22 '13 at 09:30

Nick Cox

48,377
8
110
156

answered Aug 22 '13 at 09:12

seanv507

4,305
16
25

1

I'm mostly referring to cases of discrete association (by regression, I had GLMs in mind, not just OLS). Actually, a lot of scientists studying complex phenomena (e.g. genetics) might say they are more interested in just what you're describing (detect _any_ relationship). The lure of escaping the obvious common criticism of "what if the functional form of the correlation is wrong? Of course I want to detect _any_ relationship!" is strong. However, I think there's a no-free-lunch fallacy at play here, but that would be being overlooked that I'm trying to better articulate/understand. – user4733 Aug 22 '13 at 12:24
2

... I was not aware of the the relationship between LR tests and MI though, that's very interesting! – user4733 Aug 22 '13 at 12:24
First, to reemphasize: Finding _any_ codependence is _definitely_ what we are usually "interested in" in any scientific setting where we might seek out a relationship. Second, this answer talks about quantizing data into bins, but actually there exist far superior methods based on nearest-neighbor distances (since 2004). See [here](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.mutual_info_regression.html) for python code and paper references. What I still don't know is how this compares to other methods in terms of power, computational efficiency, etc. – Neil Traft Sep 17 '21 at 03:57
I took binning as a simple example to understand. "[Same goes if you do a continuous density estimate.] Basically there are too many things to estimate before actually calculating the MI." - I fully understand there are plenty of ways of estimating a density. – seanv507 Sep 17 '21 at 07:02
Of course, if you had infinite data, 'Finding any codependence is definitely what we are usually "interested in"', but the typical scientist, doesn't and is looking for a smooth "simple" relationship, not using a method that is very likely to overfit precisely because it allows for any possible relationship in the data. – seanv507 Sep 17 '21 at 07:08

Why don't statisticians use mutual information as a measure of association?

1 Answers1

Linked