Repeated k-fold CV of sub sample - repeat the k-fold CV or repeat the sub sampling?

Question

I want to do support vector regression using repeated k-fold cross-validation on a large dataset of 30k points. Because I do need to do a lot of those regressions I want to downsample the data first to 1000 points to make the regression faster.

Now, I am not sure what the best way to do the repeats.

Should I select the sub sample of 1000 points and then create new k-folds for each of the repeats.
Or should I create a new sub sample of 1000 points for each of the repeats and then just do on k-fold CV?

[EDIT] based on cbeleites answer I want to be a bit more specific of what I want to do:

I think what confuses me might be the destinction between aproaches for parameter tuning, model validation, and model selection for prediction. Isn't it the cases that frequently you want to do all of that, like also in my case. I first want to tune the model parameters (cost and gamma in my case of using epsilon-SVM), then use the best parameters from the tuning and fit the final model which should then be used for predicting new values.

I am not really a statistician, but based on what I read (and understood) about best practices I implemented the following procedure. As I am coding in C# I created kind of pseudo-pseudo code that I hope everyone can understand:

// want to do repeated k-fold CV with k = 5 and 3 repeats
var kFold = 5
var repeats = 3

//  1. start with the whole sample of 30k points
var wholeSample = GetWholeSample()

// 2. create a tuning grid of model parameters (here cost and gamma for SVM using radial kernel)
var paramGrid = CreateParamGrid(params: "cost, gamma", tuneLength: 14)

// to save metrics (RMSE) for each param combination for each of the repeats
var modelMetrics = map(modelParams => repeatRMSEs = array[repeats])

foreach (repeat in 0..repeats)
{
    // 3. select a sub sample of 1000 points (random selection or structured sub sample)
    var subSample = GetSubSample(mode: random, size: 1000)

    // 4. select random indices for each of the k folds
    var cvFoldIndices = CreateCvFolds(subSample, kFold)

    foreach (var params in paramGrid)
    {
        // to track the RMSE of each CV iteration
        var cvRMSEs = array[kFold];

        foreach (var holdOutFold in 0..kFold)
        {
            var holdoutIndices = cvFoldIndices[holdOutFold]
            var holdOutData = subSample[holdoutIndices]
            var trainData = subSample[-holdoutIndices]

            // 5. fit model on train data using current parameter combination
            var modelFit = epsilonSVM.Fit(data: trainData, params: params)

            // 6. predict values for the hold-out data
            var predictedTargets = modelFit.Predict(data: holdOutData.X)

            // 7. assess the prediction metrics for the hold-out data
            var predictionRMSE = EvaluatePrediction(metric: "rmse", data: holdOutData.Y, prediction: predictedTargets)
            cvRMSEs[holdoutIndex] = predictionRMSE
        }

        // 8. save average RMSE for this param combination
        modelMetrics[params].repeatRMSEs[repeat] = average(cvRMSEs)
    }
}

// 9. select best model params with min RMSE averaged over repeats
var bestMetric = min(average(modelMetrics.repeatRMSEs))
var bestParams = bestMetric.modelParams

// 10. train on whole sample of 30k points
var finalModelFit = epsilonSVM.Fit(data: wholeSample, params: bestParams)

// 11. use finalModelFit to predict new sampeles
var newData = GetNewData()
var newPredictions = finalModelFit.Predict(data: newData.X)

So my questions are (I know I should only ask one specific question, but these are strongly interrelated):

is this approach valid in general
should I create a new 1000 point sub sample in each repeat (step 3)?
should I fit the final model on all 30k data (step 10)
How would I then estimate performance of my final model?

score 1 · Accepted Answer · edited Apr 13 '17 at 12:44

This depends what you want to use the validation result for:

if you want to validate (measure performance) the SVM trained on the first 1k cases (and then want to use that model for prediction), do not subsample. However, as you then have 29k cases left out of training, use them for testing instead of doing k-fold cross validation.
if you want to tune hyperparameters, best cover the variation of sets of 1000 cases out of your 30000 cases and subsample repeatedly. And if you are anyways going to subsample, why not repeat set (or hold-out) validation?
- In that case, do not forget to reserve independent test data for validation of the tuned model (e.g. an outer k-fold CV in which you nest the repeated set val).
Also remember that your hyperparameters may be sensitive to the number of training cases .

Repeated k-fold CV of sub sample - repeat the k-fold CV or repeat the sub sampling?

1 Answers1