5

Setting: Multi-class classification problem with discrete nominal features.

There are many references mentioning the use of IG(Information Gain) and MI (Mutual Information) as measure of feature relevancy for filter-based feature selection. However, from the information-theoretic viewpoint it's not completely clear to me what is the difference between these two (and if there is any):

IG(X,Y) ?=? MI(X,Y) = H(X)-H(X|Y) = H(Y)-H(Y|X) = H(X) + H(Y) - H(X,Y) = ...

Notes:

ANSWER:

Please see accepted answer below. There is no indication that MI and IG have different meaning in context of information theory - they are the same measures.

Regarding WEKA and different results. I have used Feature Selection package, which always assumed continuous attributes and therefore Weka's discretization was applied before computing Information Gain. Running IG directly from WEKA without discretization, produced the result equivalent to MI.

Oleg Shirokikh
  • 795
  • 1
  • 8
  • 17

1 Answers1

3

They are identically the same.

If $X$ is a nominal feature with $k$ different values and $C$ is your target class with $m$ classes.

\begin{align} \mbox{MI} &= \sum_{i=1}^k\sum_{j=1}^m P(x_i,c_j)\log \frac{P(x_i,c_j)}{P(x_i)P(c_j)}\\ &=-\sum_{i=1}^k\sum_{j=1}^m P(x_i,c_j)\log P(c_j) + \sum_{i=1}^k\sum_{j=1}^m P(c_j|x_i)P(x_i)\log P(c_j|x_i)\\ &= -\sum_{j=1}^m P(c_j)\log P(c_j) + \sum_{i=1}^kP(x_i)\sum_{j=1}^m P(c_j|x_i)\log P(c_j|x_i)\\ &= H(C) - H(C|X) = \mbox{IG} \end{align}

This is a case of inconsistent naming. You might want to have a look at this question too.

Simone
  • 6,513
  • 2
  • 26
  • 52
  • Thanks, that's what I'm thinking this way too - as I wrote, math tells exact equivalence. Are you familiar with Weka? I'm wondering where is the discrepancy coming from. – Oleg Shirokikh Sep 24 '14 at 23:37
  • I guess the discrepancy originated from the first articles on decision trees: about ID3 and C4.5. It was appealing to maximize the information gain to the class due to a split according to a feature. Then this term started to be commonly used in machine learning. However this is just my guess. – Simone Sep 24 '14 at 23:49
  • That is good to know! It seems that it can be concluded that it is just a matter of ambiguous naming. The only open question is how Weka produces different results – Oleg Shirokikh Sep 25 '14 at 00:28
  • What are you comparing weka with? are all features categorical with no missing values? – Simone Sep 25 '14 at 00:32
  • Yes, all categorical, clean. I'm just taking some toy examples, so I can verify the results by hand as well - and they do not match with Weka.. Also, I was able to find couple source (not sure how respectful they are) but they mention IG and MI as different measures: http://www.slideshare.net/guo_dong/feature-selection http://doras.dcu.ie/16194/2/A_Study_on_Mutual_Information-based_Feature_Selection_for_Text_Categorization.pdf http://link.springer.com/chapter/10.1007/978-3-642-29075-6_11 http://jmlr.org/proceedings/papers/v10/sanasam10a/sanasam10a.pdf – Oleg Shirokikh Sep 25 '14 at 00:58
  • I just figured out that in these references the authors do not use Mutual Information but actually something called pointwise mutual information: http://en.wikipedia.org/wiki/Pointwise_mutual_information. Maybe try to post your examples here and the correspondent Weka outcome. – Simone Sep 25 '14 at 01:46
  • Thanks, Simone - it's resolved. Please see update in the question – Oleg Shirokikh Sep 26 '14 at 19:20