Expanding my comment: there’s a computational reason you mightn’t want this, and there’s a methodological one.
This requires you to define the distribution $P(X\mid Y)$, even though you’re typically modeling the other direction. How would you compute this conditional entropy, especially in discriminative models?
Mutual info is easier because you could instead compute it as $H(Y) - H(Y \mid X)$ by including or removing the feature. Plus, you only need to compute $H(Y)$ once. It can be reused amongst all features—or just skipped because it’s a constant.
Further, what would $H(X \mid Y)$ mean here? It measures how much your target reduces uncertainty about the feature. I don’t think this is what you want, because the goal is to select features that are indicative of the target. Consider the case of a perfect feature. When X takes on the values A or B, Y is always +1. When X takes on the values C or D, Y is always -1. Here, $H(Y \mid X) = 0$. But in the other direction $H(X \mid Y)$, it’s much larger—without that size being informative.