Question about calculating mutual information

Question

I have two vectors $x,y \in \mathbb{R}.$ Based on the count and vector length I can compute $p(x)$ and $p(y)$ but I have no information on the joint density. How can I calculate mutual information in this case?

What do you mean by $p(x)$ and $p(y)$? What are you counting? How exactly did these vectors arise--what are they modeling? — whuber, Jul 20 '12 at 12:41
Imagine I have two vectors $x = \begin{pmatrix} -1 \\ -2 \\ -3 \\ 1 \\ 2 \\ 3 \end{pmatrix}$ and $y = \begin{pmatrix} 1 \\ 4 \\ 9 \\ 1 \\ 4 \\ 9 \end{pmatrix}.$ If I were to compute the correlation I need to evaluate $\frac{E\left[(x-\mu_x)(y-\mu_y)\right]}{\sigma_x \sigma_y}$ which I can readily compute from vectors $x,y$ to obtain $corr(x,y) = 0.$ However, for mutual information I need $p(x)$ which I have (p(x=-1) = 1/6, p(x=-2)= 1/6, p(x=-3)= 1/6, p(x=1)= 1/6, p(x=2)= 1/6, p(x=3) = 1/6 and $p(y)$ which I also have p(y=1)=1/3, p(y=4)=1/3, p(y=9)=1/3... — user1137731, Jul 20 '12 at 13:22
and inferred based on the empirical frequencies counting occurrences and taking the vector length into account. However, then I have no information on the joint density $p(x,y).$ How could I calculate mutual information then? — user1137731, Jul 20 '12 at 13:22
You seem to be confusing vectors, random variables, and paired observations. Mathematically they may look similar, but they function quite differently. In this case it seems you intend your "vectors" really to be paired observations, whence your "$p$" is really an *observed frequency,* not a probability. But because you have *paired* data, you can just as readily estimate the joint density: it's the empirical joint density of $(x,y)$. — whuber, Jul 20 '12 at 13:25
Indeed, I would like to use the observed frequencies to model $p(x)$ and $p(y).$ For my example above, could you please give me pointers on how to compute it as example? — user1137731, Jul 20 '12 at 13:37

score 10 · Answer 1 · edited Jul 20 '12 at 20:29

10

The Mutual Information of two discrete random variables $X$ and $Y$ taking values in $R_X$ and $R_Y$ respectively is the difference between the expectation of $\log p(x,y)$ (the logarithm of the joint probability of $(X,Y)$) and the expectation of $\log\left(p(x)p(y)\right)$ (which would be the joint probability for independent variables having marginal probabilities of $p(x)$ and $p(y)$).

When the vectors constitute an iid sample of $(X,Y)$, we can compute the mutual information of their joint empirical density. This is just the observed frequency: if a particular combination of values $(x,y)$ occurs $k(x,y)$ times in the dataset out of $n$ total occurrences, the empirical density $\hat{p}(x,y)$ is just the ratio $k(x,y)/n$.

To compute expectations with respect to the empirical density, let's introduce some notation. Let $R$ ("rows") be the set of distinct observed values of $X$ and $C$ ("columns") the set of distinct observed values of $Y$. For $x\in R$ and $y\in C$, $k(x,*) = \sum_{y\in C}k(x,y)$ is the row sum, counting all elements of the dataset whose first component is $x$. Likewise, $k(*,y) = \sum_{x\in R}k(x,y)$ is the column sum. These determine the marginal densities. Notice that the sum of all the $k(x,y)$, the sum of all the $k(x,*)$, and the sum of all the $k(*,y)$ all count the elements of the dataset, whence they are all equal to $n$. The mutual information equals

$$\eqalign{ &\sum_{x\in R, y\in C} \frac{k(x,y)}{n} \left(\log \frac{k(x,y)}{n} - \log \left(\frac{k(x,*)}{n} \frac{k(*,y)}{n}\right)\right)\\ =&\frac{1}{n}\sum_{x\in R, y\in C}k(x,y)\left(\log k(x,y) - \log(n)\right)\\ &- \frac{1}{n}\sum_{x\in R}k(x,*)\left(\log k(x,*) - \log(n)\right) \\ &- \frac{1}{n}\sum_{y\in C}k(*,y)\left(\log k(*,y) - \log(n)\right) \\ =&\log(n) + \\ & \frac{1}{n} \left( \sum_{x\in R, y\in C}k(x,y)\log k(x,y) - \sum_{x\in R}k(x,*)\log k(x,*) - \sum_{y\in C}k(*,y)\log k(*,y) \right). }$$

The first equality is just exploiting properties of logarithms while the last equality is due to the sum-to-$n$ properties of the $k(,)$.

In the example, $n=6$, $R=\{-1,-2,-3,1,2,3\}$, $C=\{1,4,9\}$, all the $k(x,*)=1$, all the $k(*,y)=2$, and all the $k(x,y)$ are either $0$ or $1$. The mutual information equals

$$\log(6) + \frac{1}{6}\left(6 \times (1\times \log(1)) - 6\times(1\times \log(1)) - 3\times(2\times \log(2)) \right) = \log(3).$$

edited Jul 20 '12 at 20:29

user1137731

407
1
4
12

answered Jul 20 '12 at 14:27

whuber

281,159
54
637
1,101

4

(+1) It seems the important point for addressing the OP's concern may be to emphasize the necessity of observing $(X,Y)$ *pairs*. Observing *only* marginal counts tells you next to nothing about the (empirical) mutual information. – cardinal Jul 20 '12 at 14:34
2

Indeed, based on his answer to my question on how to infer $p(x,y)$ we look at counts of $(x_i,y_i) \in \mathbb{R}^2, i \in \{1,\ldots,n\},$ i.e. the joint density is inferred the same way we count event occurrences for the marginal densities $p(x), p(y).$ Thanks whuber and cardinal! :) – user1137731 Jul 20 '12 at 20:39
1

Note that the plugin estimator (given by whuber) can be severely biased. It is asymptotically unbiased, and it is the ML estimate, but it can be quite off when only a handful of observations are given. – Memming Jul 21 '12 at 14:51
@Memming That is a good point--which is why my answer does not advocate using the mutual information of the EDF as an estimator: it limits itself to describing the computation, as requested in the question, without suggesting that this could or should be used to estimate the mutual information of the underlying distribution. However, if you happen to know of any reasonable estimators, especially for smallish samples, then I expect a brief account of them in a separate answer (if you would care to post one) would be very welcome and well received. – whuber Jul 21 '12 at 16:26
1

I have another small query, assumed my vectorial entries do not constitute integer values but taken from the set of real numbers would this change anything in whuber's estimator? Thank you. – user1137731 Jul 23 '12 at 13:45
3

User1137731, I would caution you, adding my voice to that of @Memming, not to use this as an estimator of mutual information unless you have a *lot* of data, and even then you probably should do some careful smoothing and binning first. As far as being a computation of MI of the data goes, there's nothing in my analysis that required integer values: it works regardless. – whuber Jul 23 '12 at 14:17
Thanks, that what I thought, it might be just the case that we do not encounter many duplicate products in the sum but genuinely different terms. My vector I am actually using has dimension of 125 which should be sufficient I hope. – user1137731 Jul 23 '12 at 15:20
2

There exists a paper by Strimmer 'Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks' http://jmlr.csail.mit.edu/papers/volume10/hausser09a/hausser09a.pdf with an associated package 'entropy' http://cran.r-project.org/web/packages/entropy/index.html. But this approach seems to expect count data and not continuous data for vectors $x,y.$ – user1137731 Jul 26 '12 at 15:15
You say that k(x,y) is zero or 1, but the formula shows 6x(1xlog(1)). I am trying to figure what you put if k(x,y) is zero. Just zero for that sum? Does your refactoring of the original equation have any assumptions, or can I use that to calculate MI? Thanks! – beroe Apr 18 '14 at 00:14
1

@beroe The expression summarizes six terms, which is where the $6$s come from. Yes, when $k(x,y)=0,$ by convention (or by taking limits) we set $k(x,y)\log(k(x,y))=0$. The algebraic manipulation makes no assumptions. – whuber Apr 18 '14 at 00:28
This is many years old now, but I wonder if anyone here has a sense of the best way to do this with continuous values (not finite, not discrete). It seems we first would need to separately estimate $p(x)$, $p(y)$, and $p(x,y)$? Using either histograms, or KDEs, or some other fancier method? Is there any simpler way? Basically, I'm wondering if there is a way to measure "correlation of two variables" similar to Spearman's rank correlation, but without assuming one variable is a monotone function of the other. What's the best way to do that? – Neil Traft Jul 15 '21 at 21:54
Hmm, looks like [this question](https://stats.stackexchange.com/questions/391123/recommended-mutual-information-estimator-for-continuous-variable) asks exactly what I'm looking for. Although there is no definitive answer, the comments suggest that I am in the right ballpark: basically, you have to estimate the PDFs somehow. – Neil Traft Jul 15 '21 at 22:20
1

@Neil re "any simpler way:" see https://en.wikipedia.org/wiki/Distance_correlation for one approach. – whuber Jul 15 '21 at 22:31

Question about calculating mutual information

1 Answers1