3

I have a binary dataset and I want to build a classifier. I understand that to monitor performance I need to split to training/test set and report accuracy or any other metrics that interest me on the test set. Now, I have a new example for which I do not know the label and I want to predict it using my classifier.

My question is: should I use the classifier that I built using the training set only or merge both training and test set, build a classifier based on all available data and then predict my new example while still reporting the accuracy based only on a the test set?

Karolis Koncevičius
  • 4,282
  • 7
  • 30
  • 47
nicnaz
  • 77
  • 1
  • 5
  • 1
    I do not think it is exactly a duplicate of the question you are referring to...I was interested more in the end user application. I found this post answering my question https://machinelearningmastery.com/train-final-machine-learning-model/ – nicnaz Jul 29 '18 at 11:47
  • This is not a duplicate of the given cross-validation link, since this question is not about validation, but training and testing set. Whether you use a share of the training set as an estimator of the testing set is up to you, but not relevant for the question. See a very similar question on Stack Overflow: [Does the training+testing set have to be different from the predicting set (so that you need to apply a time-shift to ALL columns)?](https://stackoverflow.com/questions/59210109/does-the-trainingtesting-set-have-to-be-different-from-the-predicting-set-so-t). – questionto42standswithUkraine Jun 11 '21 at 15:08
  • I have raised a new flag to remove the duplicate. "This question is not a duplicate of the flagged duplicate. The question was misunderstood by every answer. See my comment under the question. It is perfectly possible that putting together training and testing set makes your prediction more up to date and thus better, EVEN if you do not know the accuracy at that point in time. You will know it in the future, though, and it is clear that not using a testing set loses the testing set's data. Which will decrease the prediction quality." *(that holds if you need a model to be up to date)* – questionto42standswithUkraine Jun 11 '21 at 15:27
  • 1
    @question The duplicate asks "Is it ever a good idea to train an ML model on all the data available before shipping it to production?" The present post asks *two* questions, of which the duplicate is one (and the other concerns reporting the accuracy). It was therefore correct to close this either as a duplicate or as being insufficiently focused. It's a good example of why we require questions to be focused. When your interpretation differs from those of *everyone* in a thread, downvoting reflects your reading but is not constructive. – whuber Jun 11 '21 at 15:48
  • @whuber Or you have not understood it either. I had the same question, and the other full thread does help finding the answer, and none of the answers here do so, since none of them think of the now unknown but in future known labels to ***make*** the future testing set. It is nothing but a model evaluation shifted in time. Which is perfectly possible in a business context, and which has nothing to do as to how you use a validation during training time or not. This question asks about training+testing set, not about validation set, and it is thus not vague. – questionto42standswithUkraine Jun 11 '21 at 15:55
  • 1
    @question Nobody is asserting it's vague. When you think *everybody* has misunderstood a question, it's time to look inward, rather than spray downvotes and accusatory comments around. Since you have a related question and are coming from the laudatory position of so clearly seeing how it is distinct from the duplicate, then you should have no trouble posing your question with sufficient clarity that it won't be closed as just another duplicate. – whuber Jun 11 '21 at 16:39
  • @whuber "And it is thus not vague" referred to your "It's a good example of why we require questions to be focused.". Look at the duplicate's question header of [Training on the full dataset after cross-validation?](https://stats.stackexchange.com/questions/11602/training-on-the-full-dataset-after-cross-validation) and the answers referring all to cross-validation, mentioning even that this would only estimate the accuracy. This question here is not about cross-validation, it is only about the testing set - a different focus, even when the starting questions there seem to be the same as here. – questionto42standswithUkraine Jun 11 '21 at 19:20

3 Answers3

3

So in your case, what you refer as the test set is in reality the validation set. And your new example would serve as the test set.

Normally, you would find the best model which minimize the validation error and use that to predict on the new example. Whether or not you retrain on the whole dataset (train + val) is up to you and can depend on the amount of data that you have and perhaps also the amount of time the training takes. But in general more data is better so I would retrain on train+val with the hyperparameters found minimizing the validation error.

Tom
  • 1,204
  • 8
  • 17
  • 2
    I do not think that I can refer to my new example as a test set as I do not know the label to check that it is correct and it is only one example... – nicnaz Jul 19 '18 at 22:10
  • Then sure, you cannot use it as a test set properly speaking. See the top answer from Karolis's link. In general, it is a good idea to retrain on the whole train set after having chosen the hyperparameters with a val set. – Tom Jul 20 '18 at 01:50
  • Downvote. See the comment of @nicnaz who rightly says that there are not labels available in the "predicting set". And your next comment even aggrees with that, thus, the answer doesn ot add value. The validation set as a share of the training set does not play a role in this question. the question is explicitly about the testing set as another part of the dataset, which shall be reused for training right before the real prediction is done. That does not depend on the validation set, and also the parameters could be tuned with or without a validation set, just by try and error. – questionto42standswithUkraine Jun 11 '21 at 14:53
1

You should be using only the model built using the training data set. Unless you realise that your training and test data are very similar and there is going to negligible difference when building a model including both, do not add test set to train for building a new model. In case you have a slightly different model, you would not be able to test how it would fare on an out of sample data (cross validation might be an alternative.

When you go into model maintenance, you could start with newer test and train. By this time, it is expected that you have more data, meaning more samples to build a more robust model.

Srikrishna
  • 40
  • 3
  • 1
    Train and test set in my case come from the same data source. I just want to build a model that an external partner could use as a black box to predict the class of any new example that he might get... So what do you think is the best approach for this case? – nicnaz Jul 19 '18 at 22:13
  • Can you explain why the test set shouldn't be used for fitting after all tuning / error estimation is complete? – dsaxton Jul 20 '18 at 00:11
  • 1
    Because I would like to use my test set to calibrate the model before I start predicting the classes for out of sample data. If I use my test data as well for training the model, this would very well change the model result and also, I have no sample of my data to calibrate this new model against (unless cross validation measures are sufficient for this use case) – Srikrishna Jul 20 '18 at 04:58
  • Downvote. You *can* use the full dataset as the final training set, using the hyperparameters from previous training. The testing set MUST represent the unbiased training set, else it does not make any sense. The predicted future will become your testing set, since after some time, you can compare the predicted classes with the real labels you will know later. Make a check. Predict both on the model of the "training set" and on the model of the "training+testing set". The latter should be more up to date and thus better, since no preciously recent data was lost on a testing set. – questionto42standswithUkraine Jun 11 '21 at 15:23
1

This is a fantastic question. Both Tom and Srikrishna make good points. What it comes down to is choosing hyperparameters. Having selected hyperparamaters using a training and validation set, and assessing accuracy using a testing set, it will not be possible to know if those same parameters will work well if you use all your data for training and validation.

I suggest you do a bit of research to determine if data generated from a sufficiently similar distribution as your training / validation data will perform well.

Chris
  • 681
  • 4
  • 13
  • Downvote. "it will not be possible to know if those same parameters will work well if you use all your data for training and validation." The question is not about validation. If you have once found out the best hyperparameters, you can well use them on the training + testing set as one full training set again. Without testing it. That should make your model more up to date. This is especially important for data that is split in variables over time, and you want to have the most recent data included in the training. I would advise using the training+testing set as final training set. – questionto42standswithUkraine Jun 11 '21 at 15:04
  • New data = new model. New model = new hyperparameter search empirical loss estimate. You have no way of knowing how training on the validation data will affect your model and thus the need for different hyperparameters to balance bias and variance. – Chris Jun 17 '21 at 21:18
  • This is not about the validation set, but about the choice of the testing set. Think of a database that has monthly feature columns spread over the last 12 months. Take the *first 9 months* to train the model, test it with the last 3 months (use a share in the training set for validation, but that is not relevant for the question). To be more up-to-date, you train a new model on the *last 9 months* without testing it for now. It will predict better than the "older" model if being up-to-date is important (normal business case). Test the model as soon as the future 3 months are in your database. – questionto42standswithUkraine Jun 18 '21 at 11:37