So I was trying to run a model using scikit-learn. In order to tune the hyperparameters, I used RandomizedSearchCV, just like this:
xg_reg = xgb.XGBRegressor()
learning_rate = np.linspace(start=0.01, stop=1, num=200)
colsample_bytree = np.linspace(start=0.01, stop=1, num=50)
max_depth = [int(x) for x in np.linspace(1, 1000, num=50)]
n_estimators = [int(x) for x in np.linspace(start=1, stop=5000, num=100)]
subsample = np.linspace(start=0.01, stop=1, num=20)
random_grid = {
"learning_rate": learning_rate,
"colsample_bytree": colsample_bytree,
"max_depth": max_depth,
"n_estimators": n_estimators,
"subsample": subsample
}
randomsearch = RandomizedSearchCV(
xg_reg, param_distributions=random_grid, cv=10, n_iter=50
)
randomsearch.fit(X_train, y_train)
After using the best parameters, I found out that the model is very good for my training data and terrible for the test data. So this might be an overfitting problem. However, most websites tell us to perform a cross-validation in order to avoid overfitting. But I already did that by using 'cv=10'. Also, they tell us to use another dataset in order to check if the model performs worse in this other dataset. But this doesn't solve the issue, just help you to confirm it.
So the question remains: What can I do now that I believe that my model is overfitted?