SVM overfitting

Question

I'm currently using a SVM (classification) to predict the outcome of a sports match. I split the data into three sets, a training, cross-validation, and test set. I have a total of 2200 sample points. I optimised a regularisation parameter by looping through a variety of regularisation parameter values and then testing its accuracy on the cross-validation set.

I had a 72.2% accuracy on the training set, 65.2% on the CV set, and 63.4% on the test set. This appears to me as overfitting, but I'm not sure how to counteract it on an SVM. I could possibly remove features, maybe select the best K parameters, but does anyone else have any ideas?

score 1 · Accepted Answer · answered Mar 14 '20 at 20:57

I would advise against dropping features from a model unless there is a clear indication that they are irrelevant for our modelling task.

That said the down-grading of performance is noticeable (72% -> 63%). Assuming that the data is properly normalised so the regularisation effect from $C$ is properly applied (i.e. the scaling is done using the parameters from the training set applied to the validation as well as the test set) I think it is more relevant to look into repeated CV.

When selecting the optimal hyper-parameters do not use a single validation set as done now but rather use repeated cross-validation (e.g. 10 times 5-Fold cross validation). That way we will more certain about the stability of our choice as well as the variance of our performance measure estimate. (Side-note: I am not a fun of using Accuracy as a performance metric, see for example the CV.SE thread on "Why is accuracy not the best measure for assessing classification models?" for more details; off-the-bat using ROC-AUC is probably preferable). Note that if the stability of our estimate is bad, using (repeated) stratified $k$-fold should be a first line of action. Finally regarding, the use of a test-set: If the test is very small it is possible we just under- or over-fit it. It is more relevant to show that the performance observed on the test-set is in line with the performance we get from CV. If the test performance is way below or above the CV performance it means that potentially either our CV is over/under-fitting or our test set are not reflecting our sample's properties.

To recap:

Ensure that scaling and regularisation are appropriate implemented.
Use repeated CV effectively as we do not have a "huge dataset".
Aim to estimate the variance of the performance estimate. Stratify the folds if need to.
Use another metric instead of Accuracy (e.g. ROC-AUC or Brier loss).
We can keep a separate test set as done now but should not treat it as magical. It is just another subset from our larger sample.

Do you mind explaining what you mean by "10 times 5-fold CV"? Does that mean to take a regularization parameter, perform 5-fold CV, then change regularization parameter value 10 times? — charl1e, Mar 14 '20 at 22:52
By "10 time 5-fold CV" I mean that we perform 5-fold CV take the average of it and that is our first estimate. We perform this 10 times and the average of the 10 estimate is our final estimate for the performance of that particular setting (choice of $C$, choice of scaling, etc.) — usεr11852, Mar 15 '20 at 00:09

SVM overfitting

1 Answers1