3

I am using the Kaggle Scikit data to learn R.

I am using the R e1071 SVM function to predict classes.

When I use:

svm(train, trainLabels, scale = TRUE, type = NULL, kernel = "polynomial")

I obtain this level of accuracy on a sample of the Train data:

> table(pred, trainLabels)
    trainLabels
pred   0   1
   0 478   8
1   12 502

which I interpret as being 98% accurate (8+12) / (478+8+12+502).

Though when I use the same prediction model on the Test data, Kaggle returns a 0.82 score, based on classification accuracy.

Can you explain why I can get such a different accuracy level?

Ferdi
  • 4,882
  • 7
  • 42
  • 62
Timothée HENRY
  • 821
  • 2
  • 11
  • 24
  • 6
    You (almost) always will do a good deal better on the training data than the test data, since you optimize the fit on the training data. – Glen_b Oct 10 '13 at 08:52
  • Yes, that makes sense. I guess I did not separate the Train data into two sets. Thank you. – Timothée HENRY Oct 10 '13 at 09:01
  • I would start by looking at plots of learning curves, described here: https://www.youtube.com/watch?v=g4XluwGYPaA – John Yetter Jan 27 '18 at 15:27
  • 3
    Possible duplicate of [What exactly is overfitting?](https://stats.stackexchange.com/questions/281449/what-exactly-is-overfitting) or https://stats.stackexchange.com/questions/304613/definition-of-overfitting – Sycorax Aug 30 '19 at 11:19
  • I don't think it's really a duplicate. Overfitting is a possible explanation, but is it the only one? – Peter Flom Aug 31 '19 at 10:44
  • 1
    @PeterFlom There's any number of possible explanations, but when I hear hoofbeats I think "horse," not "zebra." In the absence of any differentiating detail about how this situation arose, overfitting seems like the most plausible explanation. On the other hand, perhaps the scope of possible explanations so wide as to make the question *too broad* or the lack of detail makes the question *unclear*. – Sycorax Aug 31 '19 at 19:51

2 Answers2

1

You may have an over-fitting problem. Where the performance on training is much better than testing. In other words, the model is too specific for your training data, even capturing some noise in the training data, and failed to generalized to new data.

To fix it, use simpler model or regularization (reduce polynomial order and chagne $C$ in svm)

Related discussions:

What is the influence of C in SVMs with linear kernel?

How to know if a learning curve from SVM model suffers from bias or variance?

Haitao Du
  • 32,885
  • 17
  • 118
  • 213
0

You should not expect train accuracy to match test accuracy unless you tune your model. You should cross validation your model before predicting on test data. Read about Bias/Variance and Cross validation in Machine Learning.

nithish08
  • 26
  • 3