Questions tagged [cross-validation]

Repeatedly withholding subsets of the data during model fitting in order to quantify the model performance on the withheld data subsets.

Refers to general procedures that attempt to determine the generalizability of a statistical result. Cross-validation arises frequently in the context of assessing how a particular model fit predicts future observations and how to optimally select model parameters.

Methods for cross-validation usually involve withholding a random subset of the data during model fitting (training set) and quantifying how accurate the withheld data are predicted (testing set) and repeating this process to get a measure of prediction accuracy. When this partitioning procedure happens once, it's called the holdout method.

The holdout method has two basic drawbacks:

  1. In problems where we have a sparse dataset we may not be able to afford setting aside a portion of the dataset for testing
  2. Since it is a single train-and-test experiment, the holdout estimate of error rate may have high variance due to the random nature in which the data was split.

One approach to dealing with these limitations is to use k-fold cross validation.

  1. Create k equally sized partitions (i.e. folds) of the data. In practice k is often set to 10.
  2. For each of the k partitions, use k-1 folds for training the model and the kth fold for testing.
  3. For each of the k experiments, you'll get a prediction error. The average of the k prediction errors is the true error rate.

The advantage of k-fold cross validation is that all the examples in the dataset are eventually used for both training and testing, so it matters less how the data is partitioned. The variance of the resulting estimate is reduced as k is increased. The disadvantage of this method is that the training algorithm has to be rerun from scratch k times, which means it takes k times as much computation to make an evaluation.

When we set k = n, this is known as leave-one out cross validation, because each partition is made up of n-1 training data and 1 testing.

3195 questions
265
votes
13 answers

Is there any reason to prefer the AIC or BIC over the other?

The AIC and BIC are both methods of assessing model fit penalized for the number of estimated parameters. As I understand it, BIC penalizes models more for free parameters than does AIC. Beyond a preference based on the stringency of the criteria,…
russellpierce
  • 17,079
  • 16
  • 67
  • 98
242
votes
7 answers

How to choose a predictive model after k-fold cross-validation?

I am wondering how to choose a predictive model after doing K-fold cross-validation. This may be awkwardly phrased, so let me explain in more detail: whenever I run K-fold cross-validation, I use K subsets of the training data, and end up with K…
Berk U.
  • 4,265
  • 5
  • 21
  • 42
178
votes
5 answers

Training on the full dataset after cross-validation?

TL:DR: Is it ever a good idea to train an ML model on all the data available before shipping it to production? Put another way, is it ever ok to train on all data available and not check if the model overfits, or get a final read of the expected…
Amelio Vazquez-Reina
  • 17,546
  • 26
  • 74
  • 110
173
votes
4 answers

Choice of K in K-fold cross-validation

I've been using the $K$-fold cross-validation a few times now to evaluate performance of some learning algorithms, but I've always been puzzled as to how I should choose the value of $K$. I've often seen and used a value of $K = 10$, but this seems…
Charles Menguy
  • 2,277
  • 4
  • 15
  • 16
131
votes
4 answers

Nested cross validation for model selection

How can one use nested cross validation for model selection? From what I read online, nested CV works as follows: There is the inner CV loop, where we may conduct a grid search (e.g. running K-fold for every available model, e.g. combination of…
Amelio Vazquez-Reina
  • 17,546
  • 26
  • 74
  • 110
130
votes
4 answers

Differences between cross validation and bootstrapping to estimate the prediction error

I would like your thoughts about the differences between cross validation and bootstrapping to estimate the prediction error. Does one work better for small dataset sizes or large datasets?
grant
  • 1,491
  • 2
  • 11
  • 10
122
votes
8 answers

Bias and variance in leave-one-out vs K-fold cross validation

How do different cross-validation methods compare in terms of model variance and bias? My question is partly motivated by this thread: Optimal number of folds in $K$-fold cross-validation: is leave-one-out CV always the best choice?. The answer…
111
votes
5 answers

Using k-fold cross-validation for time-series model selection

Question: I want to be sure of something, is the use of k-fold cross-validation with time series is straightforward, or does one need to pay special attention before using it? Background: I'm modeling a time series of 6 year (with semi-markov…
Mickaël S
  • 1,258
  • 3
  • 10
  • 6
106
votes
10 answers

Validation Error less than training error?

I found two questions here and here about this issue but there is no obvious answer or explanation yet.I enforce the same problem where the validation error is less than training error in my Convolution Neural Network. What does that mean?
101
votes
3 answers

Feature selection and cross-validation

I have recently been reading a lot on this site (@Aniko, @Dikran Marsupial, @Erik) and elsewhere about the problem of overfitting occuring with cross validation - (Smialowski et al 2010 Bioinformatics, Hastie, Elements of statistical learning). The…
BGreene
  • 3,045
  • 4
  • 16
  • 33
90
votes
6 answers

Feature selection for "final" model when performing cross-validation in machine learning

I am getting a bit confused about feature selection and machine learning and I was wondering if you could help me out. I have a microarray dataset that is classified into two groups and has 1000s of features. My aim is to get a small number of…
89
votes
5 answers

On the importance of the i.i.d. assumption in statistical learning

In statistical learning, implicitly or explicitly, one always assumes that the training set $\mathcal{D} = \{ \bf {X}, \bf{y} \}$ is composed of $N$ input/response tuples $({\bf{X}}_i,y_i)$ that are independently drawn from the same joint…
Quantuple
  • 1,296
  • 1
  • 8
  • 20
85
votes
5 answers

Cross-Validation in plain english?

How would you describe cross-validation to someone without a data analysis background?
Shane
  • 11,961
  • 17
  • 71
  • 89
75
votes
1 answer

How to split the dataset for cross validation, learning curve, and final evaluation?

What is an appropriate strategy for splitting the dataset? I ask for feedback on the following approach (not on the individual parameters like test_size or n_iter, but if I used X, y, X_train, y_train, X_test, and y_test appropriately and if the…
tobip
  • 1,450
  • 4
  • 14
  • 11
74
votes
5 answers

Understanding stratified cross-validation

I read in Wikipedia: In stratified k-fold cross-validation, the folds are selected so that the mean response value is approximately equal in all the folds. In the case of a dichotomous classification, this means that each fold contains roughly…
Amelio Vazquez-Reina
  • 17,546
  • 26
  • 74
  • 110
1
2 3
99 100