0

I miss some very basic distinction between cross-validations used for parameter tuning and cross-validation used for calculating the performance of my algorithms (RMSE).

I have two functions: one performs grid search and the other calculates cross-validated RMSE.

def grid_search(clf, param_grid, x_train, y_train, kf):
    grid_model = GridSearchCV(estimator = clf, 
                              param_grid = param_grid,
                              cv = kf, verbose = 2)
    grid_model.fit(x_train, y_train)

def rmse_cv(clf, x_train, y_train, kf):
     rmses_cross = np.sqrt(-cross_val_score(clf, x_train, y_train, scoring="neg_mean_squared_error", cv = kf))
     return rmses_cross

The functions are called this way:

X_train, X_test, y_train, y_test =  train_test_split(dataset, Y, test_size=0.2, random_state=26)
kf = KFold(10, shuffle = True, random_state = 26)    

grid_search(clf, param_grid, X_train, y_train, kf)
# adjust parameters of a regressor
rmses_cross = rmse_cv(clf, splits, X_train, y_train, kf)

As you see I use the same KFold for my parameter tuning and exactly the same KFold set for my calculation of cross-validation RMSE.

And on basis of the calculated cross RMSEs I chose which algorithms performs better. BUT RMSEs are calculated exactly on the same folds on which hyper parameter tuning was performed.

Is it incorrect to do so? I feel that while tuning my model learns on the hold-out folds and it would be incorrect to use them when calculating the RMSEs. Should I choose different KFold for calculation of RMSE?

EDIT:

Why do those two codes produce two different results? I though the cross_val_score refits a given model to each fold. And therefore applying cross_val_score on grid_model or parameterised model should be the same.

kf = KFold(10, shuffle = True, random_state = 26)

First:

grid_model = grid_search(clf, param_grid, X_train, y_train, kf)
grid_model.fit(x_train, y_train)
clf = SVM(kernel='rbf',C=grid_model.best_params_['C'])
rmses_cross = np.sqrt(-cross_val_score(clf, x_train, y_train, 
                      scoring="neg_mean_squared_error",cv = kf))

Second:

grid_model = grid_search(clf, param_grid, X_train, y_train, kf)
grid_model.fit(x_train, y_train)
rmses_cross = np.sqrt(-cross_val_score(grid_model, x_train, y_train, 
                      scoring="neg_mean_squared_error", cv = kf))
Alina
  • 915
  • 2
  • 10
  • 21
  • Just a reminder that **R**oot**MSE** is [**subadditive**](https://math.stackexchange.com/questions/1588776/subadditivity-of-square-root-function) and should only be calculated at the very end -- and based on *"all"* **S**quared **E**rrors. – Jim Feb 14 '18 at 17:05

1 Answers1

1

RMSEs are calculated exactly on the same folds on which hyper parameter tuning was performed.
Is it incorrect to do so?

yes.

I feel that while tuning my model learns on the hold-out folds and it would be incorrect to use them when calculating the RMSEs. Should I choose different KFold for calculation of RMSE?

yes.


What you need to do is called nested cross validation.

I recommend treating the hyperparater tuning as part of model training (that's a particular point of view on what nested cross validation or a train/optimize/valdation (aka train/validate/test depending on your field) split does.

Briefly, you have 3 functions:

  • a bare bone (low level) training function (clf (training_data, hyperparameters)) w
  • a tuned model (high level) training function that internally does the hyperparameter fitting (grid_search (training_data))
  • a testing funcion (rmse_cv)

Now, as you want to measure the performance of the ready-to-use tuned model, you call rmse_cv on the tuned model training function: rmse_cv (grid_search, dataset)
(regardless of whether or not grid_search makes internal use of rmse_cv for tuning purposes as well).

See also here.

cbeleites unhappy with SX
  • 34,156
  • 3
  • 67
  • 133
  • I followed the example [here](http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html). And they use the same KFold for inner and outer loop but exactly how you mentioned, they insert grid_search classifier into rmse_cv function. What is the difference between rmse_cv(grid_search, dataset) and taking the best_parameters from grid_search, updating the classifier and doing rmse_cv(clf,dataset)? (I updated my question) – Alina Feb 12 '18 at 09:34