7

I'm new to Machine Learning. I'm basically confused about when to perform train test split.

Is the order given below correct?

  1. Split entire data into training and test set

  2. Extract Features from training data

  3. Fit the classification model to the features extracted from training data

  4. Extract the same features, which were computed in step 2, from test data

  5. Apply the fitted model in step 3 to the features extracted from test data in step 4 to evaluate the model

Debbie
  • 119
  • 7
  • 3
    Yes, your procedure is accurate. It's important keeping in mind that the validation/test set simulates samples that you don't have at the moment of model fitting. For this reason, you never use any information from these samples. – osmoc Aug 18 '20 at 14:28
  • @ping , I asked a question here : https://stats.stackexchange.com/questions/484292/machine-learning-hyperparameter-tuning-data-leakage-is-my-procedure-free-o Could u plz answer it? – Debbie Aug 24 '20 at 22:23

1 Answers1

6

Your procedure is correct generally. In a more complex loop, additional operations may include validation, hyper-parameter optimisation, feature selection etc.

Typically, feature extraction follows exploratory data analysis (EDA), where you get to know your data, analyse/summarise it, draw intuitive conclusions. In EDA, you don't necessarily do a train/test split.

Note that, if you repeat steps 2-3 in a feedback loop so that you test whether newly extracted features (e.g. interaction variables) are useful for the model or not, you'll need a validation step.

gunes
  • 49,700
  • 3
  • 39
  • 75
  • 2
    It's always a good practice doing some sort of validation also in EDA in order to reduce the chances of overfitting. – osmoc Aug 18 '20 at 14:26
  • 1
    Yeah (+1), that also depends on the depth of EDA. That's why I said *not necessarily*. – gunes Aug 18 '20 at 14:27
  • 1
    Agree. EDA is a damn beast (sorry for the technical language)! :-D – osmoc Aug 18 '20 at 14:28
  • 2
    If you do EDA before splitting rather than on the training data only, and you draw any conclusions from it that influence your later classification model, the later evaluation of the model on the test data becomes invalid, because decisions have been made that depend on the test data as well. (This is probably what ping hinted at already.) In many situations this may not make a big difference, but it can be a problem. – Christian Hennig Aug 18 '20 at 14:30
  • 1
    This question and answer assumes that there is a very large data set to start with, on the order of several thousands of samples. There's a danger in applying this approach to smaller data sets; see [this answer](https://stats.stackexchange.com/a/54921/28500) for example. Internal validation might be preferred with smaller data sets. – EdM Aug 18 '20 at 16:31
  • @Lewian , I asked a question here : https://stats.stackexchange.com/questions/484292/machine-learning-hyperparameter-tuning-data-leakage-is-my-procedure-free-o Could u plz answer it? – Debbie Aug 24 '20 at 22:24
  • @gunes : I asked a question here : https://stats.stackexchange.com/questions/484292/machine-learning-hyperparameter-tuning-data-leakage-is-my-procedure-free-o Could u plz answer it? – Debbie Aug 24 '20 at 22:25