3

I see these concepts quite often and want to see if I have the right intuitive understanding.

Model fitting is when I have a set of data and fit a model (e.g. linear regression) as 'close' to the data as possible based on some loss function (e.g. square loss). This could result in overfitting, since a higher-order polynomial model will always have less SSE than a lower-order model.

Cross validation tests the predictive ability of different models by splitting the data into training and testing sets, and this helps check for overfitting.

For instance, if I fit a second-order polynomial to linear data, I will get a lower SSE but probably not a lower prediction error. Therefore, between the two, I should choose the linear model. A different example: if I am fitting a k nearest neighbors model, then for each value of k (up to a reasonable number), fit the model as close to the training data as possible. Then, compare the prediction error on the testing data between the different values of k, and pick the one that has the lowest prediction error. For this value of k, fit the model on the entire dataset.

Do I have the right idea?

mai
  • 696
  • 4
  • 15

2 Answers2

1

You're almost there.

  • Model fitting indeed just means finding the weights/coefficients that minimize the loss function;
  • Cross-validating is repeated model fitting. Each fit is done on a (major) portion of the data and is tested on the portion of the data that was left out during fitting. This is repeated until every observation is used for testing.

However, retraining on all data after cross-validating does come at the cost of losing your 'independent' measure of performance. For example, when using regularization, the ideal amount observed in your CV error might not correspond to the ideal amount for a model fitted to a larger number of observations (i.e., your whole data set). You are not guaranteed that this 'new' model outperforms the cross-validated one, but you did just lose your ability to express its variance and bias.

Instead, you could consider bootstrapping the variance and bias of your model, or using leave-one-out cross-validation so that a larger portion of the total data is used for fitting.

Frans Rodenburg
  • 10,376
  • 2
  • 25
  • 58
  • 2
    "retraining on all data after cross-validating is not standard practice" can you substantiate this? – cbeleites unhappy with SX May 10 '19 at 11:39
  • I tried to in the following paragraphs, did you find something unclear? – Frans Rodenburg May 10 '19 at 11:46
  • 1
    Sorry, it's my question that was unclear: in my field (chemometrics) retraining on the whole data set is the standard procedure, and the CV approximation of generatlization error using CV for verification purposes (not model selection) is well established. So I'm asking for references or a study that shows not fitting on the whole data set is standard practice. – cbeleites unhappy with SX May 10 '19 at 12:01
  • 1
    Also, I'm not entirely sure from your description what your suggested final model is. After bootstrapping, that may be an aggregated model and then the CV estimate as described is not necessarily a good estimate of its predictive ability. More precisely: the un-aggregated CV estimate is off in exactly the situations where aggregation helps. – cbeleites unhappy with SX May 10 '19 at 12:02
  • While I cannot comment on what is standard (or good) practice in your field of expertise, without an independent (albeit still internal) validation, you cannot claim a certain accuracy. See for example the discussions [here](https://stats.stackexchange.com/a/330471/176202) and [here](https://stats.stackexchange.com/q/184095/176202) My field is biostatistics / AI if it matters. – Frans Rodenburg May 10 '19 at 12:23
  • My field puts strong emphasis on independent test data for generalization error estimation (and external validation on top of [lab] internal verification procedures). But: for us, cross *validation* (though IMHO cross verification would be a better term) is in the first place a technique for generalization error estimation. So it gives exactly what you ask for. If you then decide add another step in training and use this estimate to guide *training* decisions, it is of course a training error estimate (with the consequence that for getting a generalization error estimate of the completely ... – cbeleites unhappy with SX May 10 '19 at 12:55
  • ... trained model, i.e. after deciding hyperparameters, you need to get independent test data - whether from another, outer CV, a held out test set, or a validation study is irrelevant in this context). But back to cross validation: At the end of the cross validation procedure, you don't have *a* model, you do have a number of models (that differ by having slightly different training sets). So my question about your answer is: how do *you* get from this bunch of surrogate models to *the* model. – cbeleites unhappy with SX May 10 '19 at 12:55
  • *The* model for me is the one function that I call with new data as input to get one prediction per presented case. (This applies regardless of the use you put the CV estimate to a training internal use or to use it for verification) – cbeleites unhappy with SX May 10 '19 at 12:55
  • Why would you need to be restricted to a single model to get predictions? It is not hard to cast a majority vote or average models. These concepts are well described in literature (e.g. Elements of Statistical Learning) – Frans Rodenburg May 10 '19 at 13:17
  • Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/93473/discussion-between-cbeleites-and-frans-rodenburg). – cbeleites unhappy with SX May 10 '19 at 13:24
1

Yes, your understanding is correct.

Cross validation tests the predictive ability of different models by splitting the data into training and testing sets,

Yes.

and this helps check for overfitting.

Model selection or hyperparameter tuning is one purpose to which the CV estimate of predictive performance can be used. It is IMHO important not to confuse CV with to what purpose its results are employed.

In the first place cross validation yields an approximation to generalization error (the expected predictive performance on unseen data of a model).

This estimate can either be used as

  • an approximation of generalization error of the model fitted on the whole data set with the same (hyper)paramter deterimination as was used for the CV surrogate models.
  • or to select hyperparameters. If you do this, this CV estimate becomes part of model training, and you need an independent estimate for generalization error, see e.g. nested aka double cross validation.

As for overfitting within the model training procedure, CV helps but cannot work miracles. Keep in mind that cross validation results are also subject to variance (of various sources). Thus, with increasing number of models/hyperparameter settings in the comparison there is also an increased risk of accidentally (due to variance of the CV estimates) observing very good prediction and being mislead by this (see the one-standard-deviation rule for a heuristic against this).

For this value of k, fit the model on the entire dataset.

The many so-called surrogate models built and tested during cross validation are usually treated as good approximation to applying the same training function to the entire data set which allows to use the generalization error results obtained for the surrogate models as approximation for generalization error of the "final" model.
This applies regardless of the use you put this generalization error estimate to later on.

cbeleites unhappy with SX
  • 34,156
  • 3
  • 67
  • 133