What is the right way to use SVM with cross validation?

Question

I read a lots of discussions and articles and I am a bit confused on how to use SVM in the right way with cross-validation.

If we consider 50 samples and 10 features describing them.

First I split my dataset into two parts : the training set (70%) and the "validation" set (30%).

Then, I have to select the best combination of hyperparameters (c, gamma) for my SVM RBF. So I use cross-validation on the trainnig set (5-fold cross-validation) and I use a performance metrics (AUC for example) to select the best couple.

Finally, I use the best hyperparameters on the "validation" set and I measure the performance metrics. My questions are :

is the ratio 70/30 for splitting the dataset appropriate?
is it useful to use cross validation on the "validation" set?
is it better to make a loop on this procedure in order to have randomly different compositions of the training and validation sets?
if 3 is better, how many loops and which statistics on the performance metrics?
do we agree that use cross-validation on the full dataset is the worst thing to do?

The biggest thing I notice is that if you only have N=50, you almost certainly don't have enough data to investigate 10 features. 1 is probably your max. — gung - Reinstate Monica, Jun 07 '16 at 12:58
Thanks for your answer, it was just an example, it is more the general idea my trouble..; — ltor, Jun 07 '16 at 13:11
Usually data is split into train and test. Sometimes the train set is split into train and validate. A validation set is used as a mini-test set to fine tune parameters chosen via the CV process on the training set. Once a final model is chosen, it is applied to the test data set ONCE and that is it. CV should never be applied to the full (including testing) set. As Gung noticed with N=50, you probably don't have enough data to have a validation set. In short with N=50 use CV on the training set to find the best parameters and then see how well it works with test data. — meh, Jun 07 '16 at 13:15
Thanks, do you think it is better to make a loop over the process to get different constructions of the train and test set ? — ltor, Jun 07 '16 at 14:01

score 2 · Answer 1 · edited Apr 13 '17 at 12:44

In my understanding using CV on test data (separately from CV for the train data!) gives no real advantage, as you are already working with a single, fully trained model at this point. So you don't need to evaluate and choose a best model from multiple models or similar any more.

Having said this: I can imagine that calculating e.g. some performance spread from different partitions in your test data might make sense - but without the mechanisms of repeats/resampling. For example, in activity recognition, one could measure how well activities were recognized across people in the test data - which boils down to evaluating different test data partitions separately. But you cannot resample your data the same way as with training (for model evaluation and selection).

BTW: repeating the CV process is actually done frequently: it's usually called repeated cross validation and is implemented with most ML toolsets already (see e.g. here, which also addresses the amount of partitions and repeats to use). For the amount of data that should go into the train/test partition have a look at this question.

Thanks a lot for the comment and the links. That's most clear for me now... — ltor, Jun 08 '16 at 14:05

What is the right way to use SVM with cross validation?

1 Answers1