SVM prediction accuracy drops when using Test data

Question

I am using the Kaggle Scikit data to learn R.

I am using the R e1071 SVM function to predict classes.

When I use:

svm(train, trainLabels, scale = TRUE, type = NULL, kernel = "polynomial")

I obtain this level of accuracy on a sample of the Train data:

> table(pred, trainLabels)
    trainLabels
pred   0   1
   0 478   8
1   12 502

which I interpret as being 98% accurate (8+12) / (478+8+12+502).

Though when I use the same prediction model on the Test data, Kaggle returns a 0.82 score, based on classification accuracy.

Can you explain why I can get such a different accuracy level?

You (almost) always will do a good deal better on the training data than the test data, since you optimize the fit on the training data. — Glen_b, Oct 10 '13 at 08:52
Yes, that makes sense. I guess I did not separate the Train data into two sets. Thank you. — Timothée HENRY, Oct 10 '13 at 09:01
I would start by looking at plots of learning curves, described here: https://www.youtube.com/watch?v=g4XluwGYPaA — John Yetter, Jan 27 '18 at 15:27
Possible duplicate of [What exactly is overfitting?](https://stats.stackexchange.com/questions/281449/what-exactly-is-overfitting) or https://stats.stackexchange.com/questions/304613/definition-of-overfitting — Sycorax, Aug 30 '19 at 11:19
I don't think it's really a duplicate. Overfitting is a possible explanation, but is it the only one? — Peter Flom, Aug 31 '19 at 10:44
@PeterFlom There's any number of possible explanations, but when I hear hoofbeats I think "horse," not "zebra." In the absence of any differentiating detail about how this situation arose, overfitting seems like the most plausible explanation. On the other hand, perhaps the scope of possible explanations so wide as to make the question *too broad* or the lack of detail makes the question *unclear*. — Sycorax, Aug 31 '19 at 19:51

score 1 · Answer 1 · answered Apr 06 '18 at 14:08

You may have an over-fitting problem. Where the performance on training is much better than testing. In other words, the model is too specific for your training data, even capturing some noise in the training data, and failed to generalized to new data.

To fix it, use simpler model or regularization (reduce polynomial order and chagne $C$ in svm)

Related discussions:

What is the influence of C in SVMs with linear kernel?

How to know if a learning curve from SVM model suffers from bias or variance?

score 0 · Answer 2 · answered Oct 16 '17 at 07:13

0

You should not expect train accuracy to match test accuracy unless you tune your model. You should cross validation your model before predicting on test data. Read about Bias/Variance and Cross validation in Machine Learning.

answered Oct 16 '17 at 07:13

nithish08

26
3

SVM prediction accuracy drops when using Test data

2 Answers2