Recently, I am very confused about this question. My understanding is that the cross validation for the machine learning methods is just to help me choose the best combination of the hyper parameters. I think I need to refit the whole training data after I decide the best parameters for my model. However, one professor said he doesn't need it in R. I used python and I use the function "best_estimator_.predict".
My code:
from sklearn import tree
from pprint import pprint
criterion = ['entropy','gani']
max_features = ['log2', 'sqrt']
max_features.append(None)
max_depth = [[int(x) for x in np.linspace(3, 100, num = 35)]]
max_depth.append(None)
min_samples_split = [5, 10,15, 20]
min_samples_split.append(None)
min_samples_leaf = [5,10,15,20,25,30,35,40,50,60,100,200,300,400,500.600,1000]
min_samples_leaf.append(None)
random_grid = {'criterion':criterion,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf}
clf = tree.DecisionTreeClassifier()
clf_dt = RandomizedSearchCV(estimator = clf, param_distributions = random_grid, n_iter = 300, scoring = 'roc_auc' , cv = 5, verbose=2, random_state=42, n_jobs = -1)
clf_dt.fit(X_train, y_train)
cv_dt_result=clf_dt.cv_results_
Result: the roc from cross validation is: [0.6433645242811283, 0.6554538624410902, 0.6576927805477768, 0.6491496482480705, 0.6350727409329964]
Then I used this code to evaluate the test data:
#evalute is my function to get the model's performance
decision_tree_perform=evaluate(***clf_dt.best_estimator_***,X_test,Y_test)
Result:
ROC_AUC : 0.605857133223997 Sensitivity : 0.66005291005291 Specificity : 0.5516613563950842 Accuracy : 0.5602472757753563 GMean: 0.6041379037785287
You can see the roc is worse than any result from the 5 fold cross validation. So I have some questions here:
after I got the result of the cross validation such as the best parameters, should I re-train the model based on the whole training data again with the best parameters? Or best_estimators.predict is enough here.
I also tried other methods such the logistics regression, svm and the others, the result of the testing data is always worse than the any result of the cross validation. I think I just have some overfitting problem here, but the professor said it doesn't make sense that I always have a worse result from the test data comparing with it from the cross validation? I am super frustrated about it.
Hope someone could help me out. My research actually is focused on the optimization model, but I do need some good prediction as a input in my model. Thanks!