2

I'd like to learn a model from many features. As a pre-processing step, I will remove constant features. I'd like to remove almost constant features as well, i.e. I'd like to set a low variance filter. Would you have any guidance on how to implement this filter? My concern is that some features are around 100 (in which case a variance of 0.1 is very small) and some other features are around 0 (in which case a variance of 0.1 is not small).

user7064
  • 1,685
  • 5
  • 23
  • 39
  • 2
    As you figured out, filtering by variance may not be the smartest choice, as it depends on the scale of the variable. You should probably not do this, maybe. – user2974951 Feb 14 '22 at 07:45
  • Any other suggestion? – user7064 Feb 14 '22 at 07:46
  • 2
    Not really, see [Why is variable selection necessary?](https://stats.stackexchange.com/questions/18214/why-is-variable-selection-necessary?rq=1) and [A more definitive discussion of variable selection](https://stats.stackexchange.com/questions/223808/a-more-definitive-discussion-of-variable-selection?noredirect=1&lq=1) for some discussion. – user2974951 Feb 14 '22 at 07:49

4 Answers4

2

The only reasonable threshold is removing features where variance is equal to zero, so they are actually constant. Constant features bring no information, so are useless.

As about what does "low variance" mean, as you noticed it is subjective. You could quantify it using coefficient of variation $c_v = \tfrac{\sigma}{\mu}$, but why would you remove features with low variance?

Imagine you work for a company that provides e-mail services. There is a new spammer on the market that started spending spam e-mails. You noticed that all the spam e-mails mention "viagra" while this is very rare for non-spam e-mails. So simple rule to mark e-mails mentioning "viagra" as spam has nearly perfect classification performance nonetheless that the binary feature that marks usage of the word "viagra" has very low variance. It is in fact one of your best features in the dataset.

Moreover, as noticed in the comment by user2974951, in most cases you don't need variable selection. If you use different feature selection algorithms, in many cases they won't give consistent results. If your dataset is small, you should rather use expert judgement to pick the features that make sense, or use regularization.

Tim
  • 108,699
  • 20
  • 212
  • 390
  • Thank you for your reply. I have discarded the CV option because some features have zero mean. – user7064 Feb 14 '22 at 08:27
  • @user7064 that's one of the problems with $c_v$, but my answer is that you should not discard features with "low" variance because it doesn't make sense. – Tim Feb 14 '22 at 08:28
1

Some of the trouble with this is that variance has units, even though we often just look at the number. If you use the right units, you can make that number as small as you want. For instance, if you are doing measurements of how wide DNA is, measure in light years ($\approx10^{13}$km). You'll make all of the DNA width measurements be really small numbers that have a small variance. If you measure in nanometers, then you'll wind up with a larger numerical value of variance, even though it is equal to the variance measured in light years (really squared light years, since variance is squared).

However, if you suspect that DNA width influences your outcome variable of interest, then you've discarded a pertinent variable.

Even if you measure in reasonable units and wind up with a tiny value for variance, it could be that small changes in that variable result in large changes in the outcome. This reminds me of a Jerry Seinfeld routine about not liking to stay with friends when he travels.

I don't like using other people's showers. I don't know the ratio on the dial. It could be that a sixteenth of an inch is a thousand degrees!

It might sounds like a sixteenth of an inch is a small distance, but the result is that you heat the water/steam quite a bit! In other words, that small change in the position of the shower dial had a huge impact on the temperature.

Dave
  • 28,473
  • 4
  • 52
  • 104
0

Low variance features could still have a huge impact. There are lots of methods that try to figure out feature importance. Quite popular is for example the usage of random forests, see here.

frank
  • 1,434
  • 1
  • 8
  • 13
0

You can use suggestions from the recipes package. See the Details section.

dzegpi
  • 21
  • 4