I want to do support vector regression using repeated k-fold cross-validation on a large dataset of 30k points. Because I do need to do a lot of those regressions I want to downsample the data first to 1000 points to make the regression faster.
Now, I am not sure what the best way to do the repeats.
- Should I select the sub sample of 1000 points and then create new k-folds for each of the repeats.
- Or should I create a new sub sample of 1000 points for each of the repeats and then just do on k-fold CV?
[EDIT] based on cbeleites answer I want to be a bit more specific of what I want to do:
I think what confuses me might be the destinction between aproaches for parameter tuning, model validation, and model selection for prediction. Isn't it the cases that frequently you want to do all of that, like also in my case. I first want to tune the model parameters (cost and gamma in my case of using epsilon-SVM), then use the best parameters from the tuning and fit the final model which should then be used for predicting new values.
I am not really a statistician, but based on what I read (and understood) about best practices I implemented the following procedure. As I am coding in C# I created kind of pseudo-pseudo code that I hope everyone can understand:
// want to do repeated k-fold CV with k = 5 and 3 repeats
var kFold = 5
var repeats = 3
// 1. start with the whole sample of 30k points
var wholeSample = GetWholeSample()
// 2. create a tuning grid of model parameters (here cost and gamma for SVM using radial kernel)
var paramGrid = CreateParamGrid(params: "cost, gamma", tuneLength: 14)
// to save metrics (RMSE) for each param combination for each of the repeats
var modelMetrics = map(modelParams => repeatRMSEs = array[repeats])
foreach (repeat in 0..repeats)
{
// 3. select a sub sample of 1000 points (random selection or structured sub sample)
var subSample = GetSubSample(mode: random, size: 1000)
// 4. select random indices for each of the k folds
var cvFoldIndices = CreateCvFolds(subSample, kFold)
foreach (var params in paramGrid)
{
// to track the RMSE of each CV iteration
var cvRMSEs = array[kFold];
foreach (var holdOutFold in 0..kFold)
{
var holdoutIndices = cvFoldIndices[holdOutFold]
var holdOutData = subSample[holdoutIndices]
var trainData = subSample[-holdoutIndices]
// 5. fit model on train data using current parameter combination
var modelFit = epsilonSVM.Fit(data: trainData, params: params)
// 6. predict values for the hold-out data
var predictedTargets = modelFit.Predict(data: holdOutData.X)
// 7. assess the prediction metrics for the hold-out data
var predictionRMSE = EvaluatePrediction(metric: "rmse", data: holdOutData.Y, prediction: predictedTargets)
cvRMSEs[holdoutIndex] = predictionRMSE
}
// 8. save average RMSE for this param combination
modelMetrics[params].repeatRMSEs[repeat] = average(cvRMSEs)
}
}
// 9. select best model params with min RMSE averaged over repeats
var bestMetric = min(average(modelMetrics.repeatRMSEs))
var bestParams = bestMetric.modelParams
// 10. train on whole sample of 30k points
var finalModelFit = epsilonSVM.Fit(data: wholeSample, params: bestParams)
// 11. use finalModelFit to predict new sampeles
var newData = GetNewData()
var newPredictions = finalModelFit.Predict(data: newData.X)
So my questions are (I know I should only ask one specific question, but these are strongly interrelated):
- is this approach valid in general
- should I create a new 1000 point sub sample in each repeat (step 3)?
- should I fit the final model on all 30k data (step 10)
- How would I then estimate performance of my final model?