What is the way to plot a learning curve for k fold cross-validation

Question

In the Coursera video lecture by Prof. Andrew Ng, he discusses about some basic good practices in Machine Learning. At the time stamp of around 11mins, in this video lecture, https://www.youtube.com/watch?v=ISBGFY-gBug the learning curve is shown which is a plot of cross-validation error and training error vs the size of the training set. I am doing the k fold cross-validation method for hyperparameter tuning and model selection.

In this scenario,

consider the variable Xdata to be the entire feature set which is split into training set, DataTrain that is used in the k fold setup and is further split into training subset and validation subset.
So, using DataTrain we have trainData and testData for the k fold setup.
Then there is an independent test set, denoted by the variable DataTest.

When using k fold cross validation method, to plot the learning curve, would training error be the misclassification error on DataTrain and cross-validation error be the misclassification error using the validation subset, testData?

score 3 · Accepted Answer · answered Jul 21 '18 at 10:27

3

When using k fold cross validation method, to plot the learning curve, would training error be the misclassification error on DataTrain and cross-validation error be the misclassification error using the validation subset, testData?

No

The training error would be the average, over the K-folds, of the error on the trainData.
The test error would be the average, over the K-folds, of the error on testData

Remember that for each fold, the datasets trainData and testData are different.

Source:

A cross-validation generator splits the whole dataset k times in training and test data. Subsets of the training set with varying sizes will be used to train the estimator and a score for each training subset size and the test set will be computed. Afterwards, the scores will be averaged over all k runs for each training subset size

answered Jul 21 '18 at 10:27

Xavier Bourret Sicotte

7,986
3
40
72

thank you for your answer and the links. Can you please say how to calculate the variance which is often reported for k fold cross validation? Is it a scalar value of variance of the misclassification errors using `testData` test fold subset? – Srishti M Jul 21 '18 at 17:33
You-ve got to be careful with what you mean by variance... there are raging debates on this site about the theory behind variance for k-fold cross validation.. If you want to reproduce the standard deviation fill between plots as seen sklearn website in the link, then you compute the standard deviation of the K training errors (i.e. of each fold) - but this isn't really the variance of the CV estimator, it's the variance across the K folds see here https://stats.stackexchange.com/questions/61783/bias-and-variance-in-leave-one-out-vs-k-fold-cross-validation/357749#357749 – Xavier Bourret Sicotte Jul 21 '18 at 17:44

What is the way to plot a learning curve for k fold cross-validation

1 Answers1

Source: