6

Was discussing with a friend: suppose we have one model that uses 1,000 features and another that uses 100,000 features. Assuming their first 1,000 features are the same, shouldn't the one with 100,000 features always do at least as well as the 1,000 feature model?

I say this because, if there's a correlation between an additional feature and the target variable, it can learn this. If there's no correlation, the model should learn to ignore. So more features should always be at least as good as a model with only a subset of the same features.

My friend claims features can actively hamper model performance so that more isn't always better...how is this possible?

Thanks!

anon_swe
  • 473
  • 1
  • 7
  • 13
  • 1
    Your friend is right. Irrelevant features allow models to mistake noise for signal. Read about bias and variance. – Matthew Drury Jul 12 '17 at 23:05
  • Have you considered the [p>>n problem or the curse of dimensionality?](https://stats.stackexchange.com/questions/10423/number-of-features-vs-number-of-observations) – julian Jul 13 '17 at 01:32

2 Answers2

2

It depends on what you want your model to achieve. Are these features necessary to your model? For example, gene expression dataset usually have 10000 features (one for each gene), but they are all usually necessary to help determine significant genetic pathways. If the features are not helpful, then a small feature size that provides a similar accuracy to a model with a large feature is always more helpful because performance increases in terms of obtaining classification/regression results faster.

Too many features is often a bad thing. It may lead to overfitting, which makes your model specifically fit to your data, and makes the model perform horribly with another dataset. Another thing to note is that a larger feature usually requires a larger sample size, otherwise you will also be doing a lot of regularization.

That is why there is are lot of research papers in the field of feature reduction.

rmehta1987
  • 61
  • 7
1

The probability of a spurious feature-target correlation in the training set is small, but nonzero. So, we use a test set. The probability of a spurious feature-target correlation in the training set that also holds in the test set is even smaller. However, with each new feature (hypothesis), you increase the risk that this occurs. It's worse with a small number of observations (and better with more).

But, say that your friend has the first 100 features and you have 1000 and you're trying to predict housing prices. It may be that location is the 101$^{st}$ feature. There is something to be said for restricting the variance of the model while keeping all of the features---e.g., with shrinkage, dropout, or ensembles.

I think that the answer to your question is that more information is always better, but that it comes with a risk.

sjw
  • 5,091
  • 1
  • 21
  • 45