0

Recently, I am very confused about this question. My understanding is that the cross validation for the machine learning methods is just to help me choose the best combination of the hyper parameters. I think I need to refit the whole training data after I decide the best parameters for my model. However, one professor said he doesn't need it in R. I used python and I use the function "best_estimator_.predict".

My code:

from sklearn import tree

from pprint import pprint

criterion = ['entropy','gani']

max_features = ['log2', 'sqrt']

max_features.append(None)

max_depth = [[int(x) for x in np.linspace(3, 100, num = 35)]]

max_depth.append(None)

min_samples_split = [5, 10,15, 20]

min_samples_split.append(None)

min_samples_leaf = [5,10,15,20,25,30,35,40,50,60,100,200,300,400,500.600,1000]

min_samples_leaf.append(None)

random_grid = {'criterion':criterion,
        'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf}

clf = tree.DecisionTreeClassifier()

clf_dt = RandomizedSearchCV(estimator = clf, param_distributions = random_grid, n_iter = 300, scoring = 'roc_auc' , cv = 5, verbose=2, random_state=42, n_jobs = -1)

clf_dt.fit(X_train, y_train)

cv_dt_result=clf_dt.cv_results_

Result: the roc from cross validation is: [0.6433645242811283, 0.6554538624410902, 0.6576927805477768, 0.6491496482480705, 0.6350727409329964]

Then I used this code to evaluate the test data:

#evalute is my function to get the model's performance

decision_tree_perform=evaluate(***clf_dt.best_estimator_***,X_test,Y_test)

Result:

ROC_AUC : 0.605857133223997 Sensitivity : 0.66005291005291 Specificity : 0.5516613563950842 Accuracy : 0.5602472757753563 GMean: 0.6041379037785287

You can see the roc is worse than any result from the 5 fold cross validation. So I have some questions here:

  1. after I got the result of the cross validation such as the best parameters, should I re-train the model based on the whole training data again with the best parameters? Or best_estimators.predict is enough here.

  2. I also tried other methods such the logistics regression, svm and the others, the result of the testing data is always worse than the any result of the cross validation. I think I just have some overfitting problem here, but the professor said it doesn't make sense that I always have a worse result from the test data comparing with it from the cross validation? I am super frustrated about it.

Hope someone could help me out. My research actually is focused on the optimization model, but I do need some good prediction as a input in my model. Thanks!

Saleh
  • 623
  • 1
  • 4
  • 11
Joan
  • 1
  • 1

2 Answers2

0

should I re-train the model based on the whole training data again with the best parameters?

Training a ML-model is an iterative procedure: choose some model, estimate under- and overfitting and calibrate the hyper-parameters accordingly. Once you have done enough iterations and you are satisfied with the performance, you better retrain your model on the total labeled data (not only on the training set). Splitting data into train/test sets is to evaluate under- and overfitting and help you choose the hyperparameters. Once this is achieved, it makes sense to get maximal performance before using your model in real applications.

The result of the testing data is always worse than the any result of the cross validation.

As far as I understood you evaluated the prediction error:

  • on a cross-validation set that you used to calibrate your hyper-parameters and
  • on a test set, which you didn't use to calibrate your hyper-parameters.

If this is the case, then it makes perfect sense. You explicitly calibrated your model to work well on the cross-validation set. Therefore, one can expect a slight variation between test error and cross-validation error. However, if the difference between them is big, then yes, your model might be suffering from overfitting. In other words, you might have overtuned your parameters to the cross-validation set.

Saleh
  • 623
  • 1
  • 4
  • 11
  • Thanks for your help. Based on my understanding, I can use the best model or the model with the best hyper-parameters from the cross validation to retrain the model. There is one thing you said that not only on the training set, does it mean the retrain should not base on the whole training set? Should I re-generate the training and test sets that are different from the cross validation? Or it means the re-train should happen on the same whole training data set as it for the cross validation, but the evaluation should be based on the test set? Thanks in advance. – Joan Jul 13 '20 at 13:26
0

Standard machine learning procedure goes like this: split data into training and test. Perform CV on training set in order to find best performing hyper parameters. Determine generalization error on Test set. Once you know your generalization error, take your entire dataset and perform CV again to find the best hyper parameters, ignore the error you find here, the error rate is the one you found in the step before. Once you have these new hyper parameters build your model on the entire dataset using those hyper parameters.

Even better is something called nested cross validation, you can look that up if you wish.

astel
  • 1,388
  • 5
  • 17