SVM model training set vs test set

Question

I am trying to train an SVM model using Forest Fire data. I split up my data into a test and training set. I am fairly new to this type of analysis but I'm not sure what role the test data plays or even why it's recommended that the data be split into a training and test set. How do I use the test data to see how good of a fit the trained model is? Data comes from https://archive.ics.uci.edu/ml/datasets/Forest+Fires

In addition, I am using ksvm from library(kernlab) because svm from library(e1071) has not worked for me in the past. Variables day and month are categorical so I treated them as factors using as.factor(day) and as.factor(month) in the ksvm model.

    forestfires = read.csv("forestfires.csv")  # read csv file
    head(forestfires)
    summary(forestfires)

    #build training/ test sample sample
    set.seed(0508)
    sample<-sample(1:nrow(forestfires), 0.75*nrow(forestfires))
    testfire<-forestfires[sample,]
    trainfire<-forestfires[-sample,]

    #Build SVM model
    library(kernlab)

    vmod<-ksvm(log(area+1)~X+Y+as.factor(month)+as.factor(day)+
    FFMC+DMC+DC+ISI+temp+RH+wind+rain, data=trainfire, type="nu-svr")

SVMs are powerful, regularized, algorithms. They might fit your training data perfectly, but that does not mean the model built actually carry any useful information. To know if your model carry information to make predictions on unseen data you have to test it on data it has never seem before. Keep in mind `kernlab` actually includes cross-validation (use the argument `K` = 10L for example), which means training and testing in different parts of the data, which is divided into "folds". — Firebug, Jul 16 '16 at 00:48

score 2 · Accepted Answer · edited May 23 '17 at 12:39

There is a similar question regarding the division of data sets into subsets for machine learning:
What is the difference between test set and validation set?

The training data set is used for the training of your machine learning model (SVM in your case). The algorithm uses the data from the training data set to learn rules for classification/prediction.

The testing data set is used for testing your model on data that was not used for training. So whether the rules learned by the training data set also apply to the testing data, therefore an error rate is computed. If you have a classification model for categorical classes, then a confusion matrix is used to estimate error rates (Simple guide to confusion matrices). For the classification of continuous classes the RMSD is often computed.

For some basic rules on how to split your data into training and testing subsets see:
Is there a rule-of-thumb for how to divide a dataset into training and validation sets?

SVM model training set vs test set

1 Answers1