0

I am self learning machine learning right now, and I am confused with what should I do first.

  1. Should I impute the missing value before encoding the categorical variable?
  2. Also, I am learning from Kaggle, and it always split to train, test set before doing any feature engineering stuff. What is the reason behind it? Can I doing it for the entire dataset?
  3. When should I perform cross validation? Before splitting the data?

I also hope to know the reason behind all the decision because I don’t want to just memorize it. It was difficult to learn by myself for this extremely complex topic. Thank you in advance!

  • Similar Qs with As: https://stats.stackexchange.com/questions/499228/what-is-the-correct-order-in-a-machine-learning-model-pipeline, https://stats.stackexchange.com/questions/95083/imputation-before-or-after-splitting-into-train-and-test, https://stats.stackexchange.com/questions/440372/feature-selection-before-or-after-encoding, – kjetil b halvorsen Jul 20 '21 at 01:41

1 Answers1

0

Most times imputing missing values are for numeric features and has nothing to do with encoding which is for categorical data. So, deal with missing value before encoding will seem like a good choice.