4

I am a little bit confused with the grid search interface from scikit-learn. From examples I found snippets like that

clf = GridSearchCV(SVC(C=1), tuned_parameters, cv=5,
                   scoring='%s_weighted' % score)
clf.fit(X_train, y_train)

I imagine what when calling fit the exhaustive search happens and then the estimator is being fitted with the best parameters that were found.

My question is, after I call fit can I go on and call predict or is my estimator considered overfit in this case? Should I create another estimator using the best parameters and then perform a cross_validation to see what it actually scores?

LetsPlayYahtzee
  • 528
  • 1
  • 6
  • 17

2 Answers2

1

Looking at the docs the fit description says "Run fit on the estimator with randomly drawn parameters"

Actually sklearn.grid_search.GridSearchCV.fit()'s docstring is "Run fit with all sets of parameters".

My second question is, after I call fit can I go on and call predict

Yes.

Franck Dernoncourt
  • 42,093
  • 30
  • 155
  • 271
  • 2
    for the first part, yes, that was my mistake! For the second question, wouldn't is be more representative if I go and recreate the estimator with the best params and then go and a do a cross_vall_predict? When doing the first I tend to get slightly better results, whereas when I do the second the results I get match the grid search best param score. Also looking at the answer [here](http://stats.stackexchange.com/questions/11602/training-with-the-full-dataset-after-cross-validation?rq=1) I get the feeling that calling `.predict(X)` after I have called `.fit(X)` will produce missleading results – LetsPlayYahtzee Aug 26 '16 at 22:37
  • I think that GridSearchCV performs CV to obtain the scores but trains on the whole dataset. So although the best params indicate the estimator with the better generalization ability using predict on the same data will give a slightly enhanced prediction due to the estimator has previously seen the data. To actually see what the generalization ability of your estimator is I think it's better to perform a kfold cross (fit predict) with a newly created classifier and take the averages – LetsPlayYahtzee Aug 26 '16 at 22:44
0

You should do the following: (i) you get the best estimator from the grid search (that you correctly ran using only training data), (ii) you train the best estimator with your training data and you test it in your test data:

clf = GridSearchCV(SVC(C=1), tuned_parameters, cv=5,
               scoring='%s_weighted' % score)
clf.fit(X_train, y_train)
model = grid.best_estimator_
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

Another reasonable option would be to get the best estimator from the grid search, and do a cross validation with the entire dataset.