Been doing reading on feature selection and hyperparameter tuning but I'm getting lost on how to properly code/set up the experiment. I am doing a classified ML experiment, I have 1200 samples and 400 features, and would like to optimize my models. My plan is to do a stratified k-fold analysis, use RFE for feature selection and do hyperparameter tuning for models where applicable. My understanding is that both the feature selection + hyperparameter tuning should occur at each fold of the looping process? I was wondering how that would be done in python. My instinct is that I have to use some combination of RFE (or RFECV) and GridSearchCV?
Does this thought process make sense?
- Split the data into training/test set, discard the test set for now.
- Using the training set, use GridSearchCV to do the cross-validation w/ stratified K-fold, and embed RFE within the loop
- Select the best model
- Fit to Test set
OR
- Split data in training/test
- K-Fold RFE selection for a given model
- Select those features identified by RFE
- Then perform hyperparameter tuning on those features
Does this make sense? Could someone provide an example code so I can see it laid out?
Thanks!