1

I have a dataset of roughly 1000 subjects' data and their labels. The ultimate goal is to build a neural network which can classify the data as either 0 or 1. My current strategy is to do 10-fold CV, wherein I take a different 10% of the data for testing while training on the remaining 90% (repeated many times). There are many hyperparameters (hidden layer size, net type, etc) which I am optimizing based on the average accuracy of the 10-folds. So at the end of this, I've obtained the best hyperparameters to get the highest average accuracy.

When it is time to actually use this classifier on new data, which model do I use? If my training set is now those 1000 datapoints, then my previously optimized hyperparameters are no longer optimal for that training set.

I may have a fundamental misunderstanding of k-fold, but if not then how do I go about doing this (designing a classifier with labels to be used on unlabeled data in the future)?

a13a22
  • 163
  • 12

2 Answers2

2

It sounds like your concern is that you've tuned your hyperparameters for a training set of size 900. But you're eventually going to be using a training set of size 1000, when you retrain a single model on all your data. That's the way it goes. Yes, perhaps you could do better with slightly different hyperparameter settings on the larger training set, but I don't know a straightforward way to predict what hyperparameter settings would be better with the larger dataset.

DavidR
  • 1,627
  • 11
  • 15
1

Re-train the model (including HP) on the full dataset, and use that on new data. That model is the one for which your cross validation performance estimates hold. For details see Model selection and cross-validation: The right way.

Note: there isn’t a question of which HP set is best (eg say you got $\lambda=7$ in one of your folds and $\lambda=8$ in another and are worried that neither will be optimal for the model trained on the full dataset); you start from scratch and find a new $\lambda$ for the full dataset. Hence you are cross validating a modeling process that involves both training parameters AND hyper parameters. You do NOT take one of the $\lambda=\{7,8\}$, fix it, and train your (non hyper) parameters alone on the full dataset. That’s not the process you’ve cross-validated.

sjw
  • 5,091
  • 1
  • 21
  • 45