Feature Selection: Information Gain VS Mutual Information

Question

Setting: Multi-class classification problem with discrete nominal features.

There are many references mentioning the use of IG(Information Gain) and MI (Mutual Information) as measure of feature relevancy for filter-based feature selection. However, from the information-theoretic viewpoint it's not completely clear to me what is the difference between these two (and if there is any):

IG(X,Y) ?=? MI(X,Y) = H(X)-H(X|Y) = H(Y)-H(Y|X) = H(X) + H(Y) - H(X,Y) = ...

Notes:

There are many ways to estimate MI either in Matlab, R or even by hand with toy example. However, I couldn't find a single reference mentioning both measures and the differences between them.
My calculation (and understanding) of MI for discrete features returns the expected results and match existing MI f-ns (in R, Matlab). However, Weka has a f-n InfoGainAttributeEval: http://weka.sourceforge.net/doc.dev/weka/attributeSelection/InfoGainAttributeEval.html, which produces IG measure that is not equal to MI for the same data.
There are some similar questions in SE forums and all suggest equivalence of IG and MI (e.g. Information gain, mutual information and related measures). In that case - how IG is calculated in Weka and why it doesn't match MI?
Weka's IG is available from http://featureselection.asu.edu/documentation/infogain.htm in the context of feature selection

ANSWER:

Please see accepted answer below. There is no indication that MI and IG have different meaning in context of information theory - they are the same measures.

Regarding WEKA and different results. I have used Feature Selection package, which always assumed continuous attributes and therefore Weka's discretization was applied before computing Information Gain. Running IG directly from WEKA without discretization, produced the result equivalent to MI.

score 3 · Accepted Answer · edited Apr 13 '17 at 12:44

3

They are identically the same.

If $X$ is a nominal feature with $k$ different values and $C$ is your target class with $m$ classes.

\begin{align} \mbox{MI} &= \sum_{i=1}^k\sum_{j=1}^m P(x_i,c_j)\log \frac{P(x_i,c_j)}{P(x_i)P(c_j)}\\ &=-\sum_{i=1}^k\sum_{j=1}^m P(x_i,c_j)\log P(c_j) + \sum_{i=1}^k\sum_{j=1}^m P(c_j|x_i)P(x_i)\log P(c_j|x_i)\\ &= -\sum_{j=1}^m P(c_j)\log P(c_j) + \sum_{i=1}^kP(x_i)\sum_{j=1}^m P(c_j|x_i)\log P(c_j|x_i)\\ &= H(C) - H(C|X) = \mbox{IG} \end{align}

This is a case of inconsistent naming. You might want to have a look at this question too.

edited Apr 13 '17 at 12:44

Community

1

answered Sep 24 '14 at 23:21

Simone

6,513
2
26
52

Thanks, that's what I'm thinking this way too - as I wrote, math tells exact equivalence. Are you familiar with Weka? I'm wondering where is the discrepancy coming from. – Oleg Shirokikh Sep 24 '14 at 23:37
I guess the discrepancy originated from the first articles on decision trees: about ID3 and C4.5. It was appealing to maximize the information gain to the class due to a split according to a feature. Then this term started to be commonly used in machine learning. However this is just my guess. – Simone Sep 24 '14 at 23:49
That is good to know! It seems that it can be concluded that it is just a matter of ambiguous naming. The only open question is how Weka produces different results – Oleg Shirokikh Sep 25 '14 at 00:28
What are you comparing weka with? are all features categorical with no missing values? – Simone Sep 25 '14 at 00:32
Yes, all categorical, clean. I'm just taking some toy examples, so I can verify the results by hand as well - and they do not match with Weka.. Also, I was able to find couple source (not sure how respectful they are) but they mention IG and MI as different measures: http://www.slideshare.net/guo_dong/feature-selection http://doras.dcu.ie/16194/2/A_Study_on_Mutual_Information-based_Feature_Selection_for_Text_Categorization.pdf http://link.springer.com/chapter/10.1007/978-3-642-29075-6_11 http://jmlr.org/proceedings/papers/v10/sanasam10a/sanasam10a.pdf – Oleg Shirokikh Sep 25 '14 at 00:58
I just figured out that in these references the authors do not use Mutual Information but actually something called pointwise mutual information: http://en.wikipedia.org/wiki/Pointwise_mutual_information. Maybe try to post your examples here and the correspondent Weka outcome. – Simone Sep 25 '14 at 01:46
Thanks, Simone - it's resolved. Please see update in the question – Oleg Shirokikh Sep 26 '14 at 19:20

Feature Selection: Information Gain VS Mutual Information

1 Answers1

Linked