Basic question about feature selection

Question

I am new to machine learning.

I have a basic question about feature selection. I have a dataset with 100 features which I used to regress an output Variable. When I do regression with all the features, I get a particular regression error, r1. When I do feature selection (using step forward feature selection) and select X (which is less than 100) features, I get a better regression r2 (r2 << r1).

I am trying to answer the question how the machine learning algorithm is doing worse with more features. Isn't the performance of the algorithm supposed to increase (or at least remain the same) when we add new features? Does it mean that the algorithm is not a good choice for the problem or does it mean that I don't have enough data for the algorithm to learn?

Can you please help me?

If you evaluate your fitted model on an independent test set, overfitting may be one explanation. — mloning, Oct 14 '19 at 10:30
What exactly do you mean by "regression error"? Also, stepwise and forward variable selection are not good methods. This has been discussed here a lot. — Peter Flom, Oct 14 '19 at 11:12
Hi Peter, By regression error, I meant the rmse error. What other methods of feature selection are recommended instead of step-forward feature selection? Thanks a lot. — Peter Coggle, Oct 22 '19 at 09:43
@mloning, thanks for your input. Will try the model on a test set. — Peter Coggle, Oct 22 '19 at 09:45

score 0 · Answer 1 · answered Oct 14 '19 at 13:03

The general rule is, the more features you have, the more likely it is that your model overfits. The more data (samples) you have, the less likely is it, that your model overfits.

Of course it also depends on the "quality" of your features / data. E.g. if you just have more samples, but your samples don't represent the distribution well of the data you apply to your model later, the additional samples might not add accuracy to your model.

And of course the propencity of the model for overfitting also plays a role.

To know if your model overfits, you should look at the score (like mae or mse) your model reaches on the training and the validation/test data. E.g. if it's nearly 100% acurate on the training data and moderately accurate on the validation data, you can assume that it overfits and that removing features could even improve the situation (but probably you should try regularization measures first).

Thanks a lot for your inputs, Jottbe – Peter Coggle Oct 22 '19 at 09:44 — Peter Coggle, Oct 22 '19 at 09:44

Basic question about feature selection

1 Answers1