3

I am going through the excellent book "Introduction to Machine Learning with Python," and reading about cross-validation. I can understand how it makes a more efficient use of the data than a typical train-test split, but the book also contains the caveat:

It is important to keep in mind that cross-validation is not a way to build a model that can be applied to new data. Cross-validation does not return a model... multiple models are built internally, but the purpose of cross-validation is only to evaluate how well a given algorithm will generalize when trained on a specific dataset.

So if cross-validation doesn't produce a model, does that mean that after performing cross validation, I need to build a model in the typical method using a train/test set? If so that would imply that my cross-validation scores would typically be higher than my final model's scores, since cross-validation makes a more efficient use of the data.

Or is it held that after cross-validation, I can simply train my model against all data without any further test set? That would mean that I've never tested my actual model, so it sounds wrong, but perhaps cross-validation is a valid test since it uses every sample in both training and test? If so it implies that my cross-validation scores would typically be lower than my final model's scores, since only my final model would be trained against all samples.

Stephen
  • 786
  • 1
  • 9
  • 20

2 Answers2

4

Or is it held that after cross-validation, I can simply train my model against all data without any further test set?

Yes - the cross validation is a (more efficient) replacement for that test set.

That would mean that I've never tested my actual model, so it sounds wrong, but perhaps cross-validation is a valid test since it uses every sample in both training and test?

CV treats its training sets as a good approximation to the whole data set (as do other types of resampling validation such as out-of-bootstrap etc.), so it is approximately right.

There are numerous studies on the error you make with different validation schemes that consider the total = systematic + random error (bias + variance). Turns out, with sample sizes (< a few 1000 independent cases) it is better than the alternative of train-test-split where - as you say - you have the advantage of being unbiased but pay for this with much higher variance.

If so it implies that my cross-validation scores would typically be lower than my final model's scores, since only my final model would be trained against all samples.

Yes - cross validation will have a slight pessimistic bias if done correctly (depending on the slope of the learning curve between the CV train sample size and the total sample size). You trade that for less variance (depending on the test sample size for the train-test split and the total sample size for CV).

cbeleites unhappy with SX
  • 34,156
  • 3
  • 67
  • 133
  • Upvoted for the thoughtful answer. I am still wrapping my head around these concepts and going through the book. My goal is to accept an answer once I understand everything better, since the two answers are somewhat different in emphasis. – Stephen May 14 '17 at 18:52
  • One question I do have about your answer: How does the train-test split have the advantage of being unbiased? I get that CV has the slight bias of having a smaller training size than the total sample size, but the train-test split would have this too. – Stephen May 14 '17 at 19:04
  • @Stephen, the train-test split claims that *train* is the training set, and there is no model build on all data. The CV estimate is extrapolated: it is assumed to be a good estimate of the generalization performance of a model trained on all data. (BTW, a similar extrapolation exists for train-test splits, which I've met under the name of set validation - and yes, I think there's a whole lot of confusion and differing vocabulary around with these topics) – cbeleites unhappy with SX May 17 '17 at 08:32
4

The phrase from the book doesn't sound right to me, but maybe I don't have the whole context.

Let's assume that we split your whole data set into two: train and test (for example 70%, 30%).

If we apply a cross validation scheme to the training set (not touching test set) by splitting it into several sections (folds), and run the machine learning algorithm with the intention to find the values of hyperparameters that produce the lowest training error, we of course end up with a model (meaning that we also find the values of the parameters of the model).

If we think for a moment why we are doing cross validation, we see that the reason is to reduce the number of freely adjusting parameters. That's why the number of hyperparameters are almost always lower than the number of model parameters. We do this because the number of free parameters increases the bias (very optimistic) in estimation of the expected training error.

Having said that, the cross validation still has freely varying hyperparameters. Therefore the bias is reduced but not completely eliminated. For this reason when you actually run the model emitted by the cross validation on the test set (30% part) the performance will still be worse "on the average" compared to the training.

That's why it is a good idea to still keep a test set aside even if you are building your model with cross validation. The error of the model on the test set is the most honest estimate of expected error.

Cagdas Ozgenc
  • 3,716
  • 2
  • 29
  • 55
  • Upvoted for the thoughtful answer. I am still wrapping my head around these concepts and going through the book. My goal is to accept an answer once I understand everything better, since the two answers are somewhat different in emphasis. – Stephen May 14 '17 at 18:52
  • I do have one question about your answer: You say that cross-validation (or perhaps "grid search with cross validation") is done to reduce the number of freely adjusting parameters. My understanding so far is that this is done to *choose* such parameters. This process will produce bias because of the likely overfitting of these hyperparameters. That part I understand, but is there something more you're referring to here other than the evaluation of many parameters to choose the best one? Perhaps the decision not to use some parameters at all? – Stephen May 14 '17 at 19:12
  • Everytime you adjust and test a parameter on the same data set you are introducing a bias. Model parameters are set and evaluated on training set hence create a bias. Hyperparameters are set and evaluated on the validation set hence create a bias but lower. I don't know what you mean by not using parameters. Without parameters there's nothing to train. – Cagdas Ozgenc May 14 '17 at 19:37
  • Sorry, let me rephrase my question to be clear: I don't understand what you mean by "reduce the number of freely adjusting parameters." I don't see why cross-validation "reduces" parameters, I thought it just "chooses" them. – Stephen May 16 '17 at 01:57
  • It doesn't reduce in the sense of removing them, but hyperparameters limit the free movement of the parameters of the model during the optimization. For example in ridge penalty, the total squared weights are bounded by lambda hyperparameter. – Cagdas Ozgenc May 16 '17 at 06:14
  • Oh ok thanks, I didn't realize you were referring to the impact of hyperparameters on regular parameters. I just didn't get your terminology but it makes sense now. – Stephen May 17 '17 at 00:04
  • There's IMHO a lot of confusion coming from the particular use of vocabulary here (= in this field). Cross validation in itself is *independent* of whether its results are used for determining hyperparameters or for, well, validation (or more precisely, verification) of generalization error. So in that context, the book sentence is exactly right. But a lot of confusion comes from the use of a technique with "validation" in its name for other (search/optimization) purposes. The point here is, model optimization relies on optimizing some figure of merit = some measure of performance. This may... – cbeleites unhappy with SX May 17 '17 at 10:50
  • ... be obtained in a number of ways, including techiques that validate/verify performance of a given model. However, using validation techiques internally during part of the *training* process (optimization is still part of training!) does not get rid of the need of doing a proper *validation* (verification) with the final model. And there again you have the choice among a number of different techniques, including cross validation. – cbeleites unhappy with SX May 17 '17 at 10:54
  • Also, in using CV for hyperparameter optimization, the CV itself does *not* return *the* model. Within the optimization, CV typoiclly just returns some figure of merit for every model (parameter/hyperparameter set) that is fed into the CV. It is a separate point how to decide with the help of these figures of merit which (hyper)parameters to use for the optimized model. And even if the default is "take the one that looked best" that is neither the only possible nor the only sensibe option. – cbeleites unhappy with SX May 17 '17 at 11:00