1

I am working on binary classification with class proportion of 77:23 (977 records)

Currently, I am exploring the feature selection approaches and came across methods like below

a) Featurewiz

b) Sequential forward and backward feature selection

c) Borutapy

d) RFE etc

Now all the above methods use a ML model to find the best performing features.

Now my question is

a) Do we have to use the best parameters for getting the best features?

b) If yes, then once we select the features, do we have to again do a gridsearchCV and find the best parameters to fit and predict?

Or do you think it is suffice to just use default parameters for feature selection and for model building we can use best parameters?

The Great
  • 1,380
  • 6
  • 18
  • 1
    I would agree with gunes. It is kind of a chicken and egg problem, and you won't find the global optimum. However, that's okay - you can still get good results. I would recommend using something like Boruta as a feature selection tool, then optimize hyperparams with gridsearchCV after using only good features (as selected by Boruta) – Vladimir Belik Feb 22 '22 at 18:18
  • Based on both of your responses, I have another related question digging deeper to understand why of certain decisions during ML model building. would you be interested to share your views on this - https://stats.stackexchange.com/questions/565454/why-is-ml-called-an-empirical-field – The Great Feb 23 '22 at 01:59
  • To be honest, I couldn't quite grasp your question there. Could you try rephrasing? I'm not sure what you mean by "random", for example. – Vladimir Belik Feb 23 '22 at 15:17
  • @VladimirBelik - by random, I mean not consistent. And why we are not able to find out a consistent specific reasoning/explanations for why ML model does what it does. – The Great Feb 24 '22 at 00:22
  • If still not able to understand, the fault might be with my english skills. Not sure, how else can I phrase that. – The Great Feb 24 '22 at 00:23
  • @VladimirBelik - I remember reading somehwere online that ML is still an empirical field and not everything can be prooved by evidence/theory. This is what I am trying to get at. Unfkrtunateky, I don't have the source of that info but that word strike a chord woth me. If you can share your thoughts on that, it would really be useful and i will be grateful – The Great Feb 24 '22 at 00:28

1 Answers1

1

Both feature selection and hyper-parameter (HP) optimization are sub-optimal. With infinite compute power, we could have done both at the same time. But we can't search the whole space, so we have approximate approaches.

Do we have to use the best parameters for getting the best features?

Typical practice is to use a good enough estimator. Usually, the best HPs are found with the complete feature set may not be the same as the ones found with a feature subset. It's a chicken-egg problem. So, you don't have to. These are all approximate approaches.

You can also use the features found by the above heuristics and include them in your HP search, e.g. include your best three feature sets and search best HPs together with these sets as well.

gunes
  • 49,700
  • 3
  • 39
  • 75