4

In the Coursera video lecture by Prof. Andrew Ng, he discusses about some basic good practices in Machine Learning. At the time stamp of around 11mins, in this video lecture, https://www.youtube.com/watch?v=ISBGFY-gBug the learning curve is shown which is a plot of cross-validation error and training error vs the size of the training set. I am doing the k fold cross-validation method for hyperparameter tuning and model selection.

In this scenario,

  • consider the variable Xdata to be the entire feature set which is split into training set, DataTrain that is used in the k fold setup and is further split into training subset and validation subset.
  • So, using DataTrain we have trainData and testData for the k fold setup.
  • Then there is an independent test set, denoted by the variable DataTest.

    When using k fold cross validation method, to plot the learning curve, would training error be the misclassification error on DataTrain and cross-validation error be the misclassification error using the validation subset, testData?

Srishti M
  • 895
  • 11
  • 29

1 Answers1

3

When using k fold cross validation method, to plot the learning curve, would training error be the misclassification error on DataTrain and cross-validation error be the misclassification error using the validation subset, testData?

No

  • The training error would be the average, over the K-folds, of the error on the trainData.

  • The test error would be the average, over the K-folds, of the error on testData

Remember that for each fold, the datasets trainData and testData are different.


Source:

A cross-validation generator splits the whole dataset k times in training and test data. Subsets of the training set with varying sizes will be used to train the estimator and a score for each training subset size and the test set will be computed. Afterwards, the scores will be averaged over all k runs for each training subset size

Xavier Bourret Sicotte
  • 7,986
  • 3
  • 40
  • 72
  • thank you for your answer and the links. Can you please say how to calculate the variance which is often reported for k fold cross validation? Is it a scalar value of variance of the misclassification errors using `testData` test fold subset? – Srishti M Jul 21 '18 at 17:33
  • You-ve got to be careful with what you mean by variance... there are raging debates on this site about the theory behind variance for k-fold cross validation.. If you want to reproduce the standard deviation fill between plots as seen sklearn website in the link, then you compute the standard deviation of the K training errors (i.e. of each fold) - but this isn't really the variance of the CV estimator, it's the variance across the K folds see here https://stats.stackexchange.com/questions/61783/bias-and-variance-in-leave-one-out-vs-k-fold-cross-validation/357749#357749 – Xavier Bourret Sicotte Jul 21 '18 at 17:44