1

I split the full data into training and test set in 80:20 ratio. Then within the training set I randomly carved out 10% and called it the dev (development) set. In the dev set, I select features and run 5-fold cross validation to find the optimal hyperparameters for each ML algorithm. After all these steps are done, I applied the selected features along with optimized ML algorithms to train and build the models using the full training data, then the trained models will be used to predict and be evaluated in the test set.

Is is appropriate to derive the dev set from the training set? Or do the dev, training and test sets have to be mutually exclusive?

Update: Suggested link Should final (production ready) model be trained on complete data or just on training set? discussed a completely different matter, thus my original question is not a duplicate.

KubiK888
  • 927
  • 1
  • 10
  • 21
  • In a nutshell - you can do that, as long as you keep your test set (on which performance is evaluated) separate, you can re-train including the dev set. There's some interesting discussion here: https://stats.stackexchange.com/questions/184095/should-final-production-ready-model-be-trained-on-complete-data-or-just-on-tra – Itamar Mushkin Dec 04 '19 at 07:17
  • (mutually exclusive) training, validation and holdout sets are known and used in the community, so you seem to be on the right track. – runr Dec 04 '19 at 12:51
  • To clarify. I am using mutually exclusive training and test sets. But the development set is derived from the training set. I would like to confirm if this is ok. – KubiK888 Dec 04 '19 at 14:21
  • If you use cross validation on your dev dataset to train and validate (i.e. select hyperparameters of) your models, what do you use the remaining 90% of your train data for? – Sammy Dec 04 '19 at 15:06
  • Within the 100% training set, 10% was randomly selected to create the dev set. The dev set was used to select the final feature sets and select the final hyperparameter values of the ML algorithms via cross-validation only. That means out of all the possibilities of feature combinations and hyperparameter values, I narrowed them down to those with the best performance within the dev set... – KubiK888 Dec 04 '19 at 15:20
  • (Con't) Once I have these final sets of feature and hyperparameter, I then used the full 100% training set to fully train the ML model, and then used the trained model to predict and be evaluated in the test set. – KubiK888 Dec 04 '19 at 15:21

0 Answers0