65

When you are trying to fit models to a large dataset, the common advice is to partition the data into three parts: the training, validation, and test dataset.

This is because the models usually have three "levels" of parameters: the first "parameter" is the model class (e.g. SVM, neural network, random forest), the second set of parameters are the "regularization" parameters or "hyperparameters" (e.g. lasso penalty coefficient, choice of kernel, neural network structure) and the third set are what are usually considered the "parameters" (e.g. coefficients for the covariates.)

Given a model class and a choice of hyperparameters, one selects the parameters by choosing the parameters which minimize error on the training set. Given a model class, one tunes the hyperparameters by minimizing error on the validation set. One selects the model class by performance on the test set.

But why not more partitions? Often one can split the hyperparameters into two groups, and use a "validation 1" to fit the first and "validation 2" to fit the second. Or one could even treat the size of the training data/validation data split as a hyperparameter to be tuned.

Is this already a common practice in some applications? Is there any theoretical work on the optimal partitioning of data?

Ferdi
  • 4,882
  • 7
  • 42
  • 62
charles.y.zheng
  • 7,346
  • 2
  • 28
  • 32

2 Answers2

85

First, I think you're mistaken about what the three partitions do. You don't make any choices based on the test data. Your algorithms adjust their parameters based on the training data. You then run them on the validation data to compare your algorithms (and their trained parameters) and decide on a winner. You then run the winner on your test data to give you a forecast of how well it will do in the real world.

You don't validate on the training data because that would overfit your models. You don't stop at the validation step's winner's score because you've iteratively been adjusting things to get a winner in the validation step, and so you need an independent test (that you haven't specifically been adjusting towards) to give you an idea of how well you'll do outside of the current arena.

Second, I would think that one limiting factor here is how much data you have. Most of the time, we don't even want to split the data into fixed partitions at all, hence CV.

Wayne
  • 19,981
  • 4
  • 50
  • 99
  • 2
    The conceptual issue I had is that if you are comparing enough models, you are effectively fitting on the validation data when you "decide on a winner" using the validation data. Hence still may be a point in partitioning the validation data. – charles.y.zheng Apr 12 '11 at 14:08
  • I think that the training-validation layer and the validation-testing layer serve different purposes in some sense, and that you do eventually have to compare models on a common validation set if you're going to declare a winner. So I'm not sure that additional layers help. (Though my knowledge isn't deep enough to really know.) The closest thing I can think of to your suggestion is how the Netflix competition was run. I believe they used partial test sets to keep teams from climbing the test set gradient, but I think it's different. – Wayne Apr 12 '11 at 21:36
  • No need to split the data into 3 parts, you only need 2. The 1st data set is used to fit your models. The 2nd data set is used to validate those fits on an independent data set. The "how well it will do in the real world" question is already answered in the 2nd data set... it wasn't used for fitting. Of course, if you want even better estimates then you'll want to randomly split your data into 2 sets many times, each time fitting your models to the training part and predicting on the validation part. –  Apr 25 '12 at 21:18
  • I think the key question is: what exactly do you mean by "fit your models"? As the original poster pointed out, there are things which might be called coefficients which are fit on a training set as a direct result of processing the data, but there are also parameters (a threshold, for example) that can be set independently of the data itself. If you were willing to enter multiple copies of each model with different choices for parameters, you could do it in two steps. But if you intend to tweak these parameters, that seems to require an additional step. – Wayne Apr 25 '12 at 21:52
  • I don't understand. If you compare the models between the validation and training and then choose the best one, you are basically only training on one dataset and, hence, overfitting. – JobHunter69 Jun 29 '16 at 23:37
  • @user10882 You are completely forgetting hyperparameter tuning, that's the reason for the three part split. – Firebug Aug 10 '16 at 17:26
  • 2
    @user10882, your comment is not correct, neither is Firebugs. Both the (1) model parameters (weights, thresholds) and (2) so called "hyper" parameters (number of hidden layers, number of decision trees), may have vastly different interpretation and feel, but are all *just parameters distinguishing between different models*. Use the training data to optimise them all, use the validation data to avoid over-fitting and use cross validation to make sure your results are stable. The test data only serve to specify the expected performance of your model, do not use it to accept/reject it. – Ytsen de Boer Jan 16 '17 at 10:53
  • @YtsendeBoer I agree with you. I encountered a real [problem](http://datascience.stackexchange.com/questions/17755/will-cross-validation-performance-be-an-accurate-indication-for-predicting-the-t), where I only break my data set into train and validation (no test set). So I don't know the performance of the winner of the validation step and when I have a real test set, I found the winner of the validation step actually very bad. Then my question is: how should I pick the winner in order to have the best performance over an independent test set from all my candidate models? – KevinKim Mar 26 '17 at 15:30
  • @YtsendeBoer: I disagree. Hyperparameters are different from "regular" parameters in that they cannot be optimized given the training data, because they typically reflect the complexity of the model, and more complex models always have lower within-sample error. Say, for instance, that I want to fit an $n$-th degree polynomial to some data. If I allow my learning algorithm to pick $n$ based on the error in the training data, the "solution" for $n$ will always equal the number of data -1, because that gives a loss of 0. Thus, for hyperparameters, we optimize the validation error instead. – Ruben van Bergen Apr 12 '17 at 12:01
  • @RubenvanBergen: you describe over-fitting. It is exactly for that reason that you have a validation step: to avoid selection of such models. It has nothing to do with the parameter being a "normal" or "hyper" parameter. – Ytsen de Boer Apr 13 '17 at 07:06
  • @YtsendeBoer: I think you misunderstand me so let me clarify. The way you would avoid overfitting (in this example) is to pick a certain range of $n$ (say 1:10), and fit the polynomial coefficients (by minimizing the training error) given each of those settings of $n$. You then compare the validation errors of those 10 fitted models to select the best value of $n$. What you don't do is fit $n$ to the training data by selecting the $n$ that minimizes the training error. So you use a different procedure to select the parameters (polynomial coefficients) vs. to select the hyperparameter $n$. – Ruben van Bergen Apr 13 '17 at 07:20
  • 1
    @RubenvanBergen: I understand what you say and it is good and useful to point that out to user10882. But I still argue that it is ultimately a technicality. Say you use a gradient descent algorithm that uses the training data to infer the step direction (including the polynomial degree $n$) together with a validation procedure that adds the validation loss to the training loss in each step of the gradient descent algorithm (similar to early stopping). Now the difference between "normal" or "hyper" is not relevant any more: it depends on the procedure. – Ytsen de Boer Apr 13 '17 at 08:33
  • 1
    @YtsendeBoer: Fair enough - if you use sth like validation-based early stopping then I agree the boundaries get blurred, at least in terms of the optimization procedure. To my mind this doesn't fully merge the concept of a "hyperparameter" with that of a regular one though. There are still many situations where they are treated differently, and I also think about them differently in terms of their roles in defining a model. Anyway, I hope this discussion has been useful to others to illustrate the (subtle) differences & similarities between these concepts =). – Ruben van Bergen Apr 13 '17 at 11:04
0

This is interesting question, and I found it is helpful with the answer from @Wayne.

From my understanding, dividing the dataset into different partition depends on the purpose of the author, and the requirement of the model in real world application.

Normally we have two datsets: training and testing. The training one is used to find the parameters of the models, or to fit the models. The testing one is used to evaluate the performance of the model in an unseen data (or real world data).

If we just do one step in training, it is obvious that there are a training and a testing (or validating) process.

However, doing this way, it may raise the over-fitting problem when the model is trained with one dataset, onetime. This may lead to instability of the model in the real world problem. One way to solve this issue is to cross-validate (CV) the model in the training dataset. That means, we divide the training datset into different folds, keep one fold for testing the model which is trained with other folds. The winner is now the one which give minimum loss (based on our own objective function) in whole CV process. By doing this way, we can make sure that we minimize the chance of over fitting in training process, and select the right winner. The test set is again used to evaluate the winner in the unseen data.