1

I am new to machine learning.

I have a basic question about feature selection. I have a dataset with 100 features which I used to regress an output Variable. When I do regression with all the features, I get a particular regression error, r1. When I do feature selection (using step forward feature selection) and select X (which is less than 100) features, I get a better regression r2 (r2 << r1).

I am trying to answer the question how the machine learning algorithm is doing worse with more features. Isn't the performance of the algorithm supposed to increase (or at least remain the same) when we add new features? Does it mean that the algorithm is not a good choice for the problem or does it mean that I don't have enough data for the algorithm to learn?

Can you please help me?

Richard Hardy
  • 54,375
  • 10
  • 95
  • 219

1 Answers1

0

The general rule is, the more features you have, the more likely it is that your model overfits. The more data (samples) you have, the less likely is it, that your model overfits.

Of course it also depends on the "quality" of your features / data. E.g. if you just have more samples, but your samples don't represent the distribution well of the data you apply to your model later, the additional samples might not add accuracy to your model.

And of course the propencity of the model for overfitting also plays a role.

To know if your model overfits, you should look at the score (like mae or mse) your model reaches on the training and the validation/test data. E.g. if it's nearly 100% acurate on the training data and moderately accurate on the validation data, you can assume that it overfits and that removing features could even improve the situation (but probably you should try regularization measures first).

jottbe
  • 150
  • 1
  • 9