When should we split the data into train, valid and test datasets?

Asked Aug 30 '20 at 05:19

Active Aug 30 '20 at 09:12

Viewed 53 times

In EDA we actually get insights about the data. That is completely fine but what i cannot understand is on which dataset should we do the EDA?

Should we do the EDA on train dataset or train + valid dataset or on the complete dataset.

I have seen many other questions similar to this and they have controversial answers. Question1 Question2 Should exploratory data analysis include validation set?

More precisely can anyone given an better explanation why i should/shouldn't use the validation dataset in the EDA than the above questions?.

edited Aug 30 '20 at 09:12

asked Aug 30 '20 at 05:19

Anonymous

1

Both answer complement each other: You should do feature selection etc. on a train set, but for the very first actions of visualizing your data, getting summary statistics, etc you should look at the whole dataset. – Itamar Mushkin Aug 30 '20 at 06:10
1

Welcome to Cross Validated! Please don't keep re-posting the same question. If there's an aspect to yours that you don't think is addressed in the one indicated as a duplicate, then edit yours to focus on that. – Scortchi - Reinstate Monica Aug 30 '20 at 10:55

0 Answers0