2

It looks like feature selection can be done with mutual information. Mutual information is related to conditional entropy by this equation:

$I(X,Y) = H(X) - H(X|Y)$

Can we use conditional entropy to do feature selection by creating a sorted list of conditional entropy computations between all features and the output variable and then picking those with the smallest conditional entropy?

user442920
  • 533
  • 5
  • 14
  • This requires you to model $P(X\mid Y)$, even though you’re typically modeling the other direction. How would you compute the conditional entropy? Mutual info is easier because you could instead compute it as $H(Y) - H(Y \mid X)$ by including or removing the feature. – Arya McCarthy Apr 06 '21 at 23:18

2 Answers2

2

Expanding my comment: there’s a computational reason you mightn’t want this, and there’s a methodological one.

This requires you to define the distribution $P(X\mid Y)$, even though you’re typically modeling the other direction. How would you compute this conditional entropy, especially in discriminative models?

Mutual info is easier because you could instead compute it as $H(Y) - H(Y \mid X)$ by including or removing the feature. Plus, you only need to compute $H(Y)$ once. It can be reused amongst all features—or just skipped because it’s a constant.

Further, what would $H(X \mid Y)$ mean here? It measures how much your target reduces uncertainty about the feature. I don’t think this is what you want, because the goal is to select features that are indicative of the target. Consider the case of a perfect feature. When X takes on the values A or B, Y is always +1. When X takes on the values C or D, Y is always -1. Here, $H(Y \mid X) = 0$. But in the other direction $H(X \mid Y)$, it’s much larger—without that size being informative.

Arya McCarthy
  • 6,390
  • 1
  • 16
  • 47
0

Yes, this is known as information gain and it can be used as a feature selection method. There are implementations of this in the R packages fselector and CORElearn. You might also consider using information gain ratio and minimum description length (both also available in CORElearn not sure about fselector).

astel
  • 1,388
  • 5
  • 17
  • [Information gain](https://stats.stackexchange.com/questions/13389/information-gain-mutual-information-and-related-measures) is $I(X; Y)$ which is already common for feature selection; that’s mentioned in the question. – Arya McCarthy Apr 06 '21 at 23:17