Feature selection via conditional entropy

Question

It looks like feature selection can be done with mutual information. Mutual information is related to conditional entropy by this equation:

$I(X,Y) = H(X) - H(X|Y)$

Can we use conditional entropy to do feature selection by creating a sorted list of conditional entropy computations between all features and the output variable and then picking those with the smallest conditional entropy?

This requires you to model $P(X\mid Y)$, even though you’re typically modeling the other direction. How would you compute the conditional entropy? Mutual info is easier because you could instead compute it as $H(Y) - H(Y \mid X)$ by including or removing the feature. — Arya McCarthy, Apr 06 '21 at 23:18

Arya McCarthy · Answer 1 · 2021-04-08T00:12:38.767

Expanding my comment: there’s a computational reason you mightn’t want this, and there’s a methodological one.

This requires you to define the distribution $P(X\mid Y)$, even though you’re typically modeling the other direction. How would you compute this conditional entropy, especially in discriminative models?

Mutual info is easier because you could instead compute it as $H(Y) - H(Y \mid X)$ by including or removing the feature. Plus, you only need to compute $H(Y)$ once. It can be reused amongst all features—or just skipped because it’s a constant.

Further, what would $H(X \mid Y)$ mean here? It measures how much your target reduces uncertainty about the feature. I don’t think this is what you want, because the goal is to select features that are indicative of the target. Consider the case of a perfect feature. When X takes on the values A or B, Y is always +1. When X takes on the values C or D, Y is always -1. Here, $H(Y \mid X) = 0$. But in the other direction $H(X \mid Y)$, it’s much larger—without that size being informative.

score 0 · Answer 2 · answered Mar 01 '19 at 21:58

0

Yes, this is known as information gain and it can be used as a feature selection method. There are implementations of this in the R packages fselector and CORElearn. You might also consider using information gain ratio and minimum description length (both also available in CORElearn not sure about fselector).

answered Mar 01 '19 at 21:58

astel

1,388
5
17

[Information gain](https://stats.stackexchange.com/questions/13389/information-gain-mutual-information-and-related-measures) is $I(X; Y)$ which is already common for feature selection; that’s mentioned in the question. – Arya McCarthy Apr 06 '21 at 23:17

Feature selection via conditional entropy

2 Answers2