How this code of cross-validation work?

Question

I am new in sklearn and I try to learn how to use cross-validation to choose the best model of an SVM. I found this example How to split the dataset for cross validation, learning curve, and final evaluation?and I tried to understand how it does work. Here are some lines that I am not sure that I have understood.

from sklearn.learning_curve import learning_curve
title = 'Learning Curves (SVM, linear kernel, $\gamma=%.6f$)' %classifier.best_estimator_.gamma
estimator = SVC(kernel='linear', gamma=classifier.best_estimator_.gamma)
plot_learning_curve(estimator, title, X_train, y_train, cv=cv)
plt.show()

1) What is the estimator object here, is it a clone of a the best model returned by the cross_validation? I did not think!

2) Is this function plot_learning_curve will apply the cross-validation selection again? I think yes, because it take a cross-valiation iterator.

classifier.score(X_test, y_test)

3) What is the model that returns this score. Is it the best model selected in the section 5) of the previous link?

classifier.fit(X,y)

4)What is the utility of this operation?

score 1 · Answer 1 · answered Feb 19 '16 at 07:45

I think you need a slight insight into cross-validation to fully understand the lines of code you're quoting. So I'll get step by step through the example you put a hyperlink to.

1.2.3. First you need to load a dataset and here it's been split into two parts: one for training the model and one for testing it. It appears that the problem you're dealing with is a classification problem, and the type of algorithm that's been chosen to classify is an SVM with a linear kernel. So far so good!

4.5.6. Now cross-validation starts. How does it work and what's its purpose? The idea is that you want to find optimal parameters (for one or some of your algorithm parameters). By optimal I mean parameters that, among those that you've tested, have provided you with the best score. To do this, people commonly split their training set into a training part and a validation part. The idea is that you'll learn a classifier on the training part and evaluate your result on the validation part (most of the time, you can't use your test dataset, though you could have here, because it is not labeled). This gives you the performance of your model on one set of parameters, then you move on another set of parameters and evaluate again the new model learned. If it's better, then this means that this set of parameters is better. You end up with a set of parameters that give the best choice of classifier (among the parameters you've tested). Finally, what you do is learn a global estimator (meaning classifier here) with the entire training set (take into account the validation set as well) and with the tuned parameters. I let you search further on the net to understand how the splits between training and validation is performed (K-fold, etc.). Last when this part is done, you evaluate your brand new model with the test set!

Now let's dive into your problems.

the estimator object is just an instance of the SVM method: it means I'll be using to learn my classification an SVM with linear kernel.
the learning curve is a curve that determines training and test scores as the size of the training set changes. Further description here. I don't think it's fundamental to understand the very basics of cross-validation though.
once you've learned a model, you need to apply it on your test set that's what classifier.score(X_test, y_test) does: it applies the classifier you've learned on your test set. The thing is, in this example, the guy has been using a test set which is labeled, so he can also compare the prediction with the true labels (y_test) and derive a score, that's what's done here. But it's not canonical, most of the time, your test set is not labeled, so you would just apply your model and have prediction (with no ground label to compare to).
this step classifier.fit(X,y) is kind of weird: it comes from the fact that the dataset is fully labeled, so once he's done everything, you can learn a model on the entire dataset. But once again, most of the time this step wouldn't exist...

Thanks for your time. I have any problem with the principle of the cross-validation. My problem is more on the function of sklearn. — BetterEnglish, Feb 19 '16 at 12:09
Ah okay, didn't get this. It's actually quite simple: your classifier has a function `fit` that takes a training set `X_train` and labels `y_train`. Then it has a predict method that takes a test set `X_test`. The function `score` takes a validation set with labels `X_validation` and `y_validation`, it computes the sum of errors of prediction on `X_validation` compared to the ground labels `y_validation`. The function `plot_learning_curve` is explained above. What else? — Vince.Bdn, Feb 19 '16 at 13:20

How this code of cross-validation work?

1 Answers1