What to do after knowing the model is overfitted?

Question

So I was trying to run a model using scikit-learn. In order to tune the hyperparameters, I used RandomizedSearchCV, just like this:

xg_reg = xgb.XGBRegressor()    

learning_rate = np.linspace(start=0.01, stop=1, num=200)
colsample_bytree = np.linspace(start=0.01, stop=1, num=50)
max_depth = [int(x) for x in np.linspace(1, 1000, num=50)]
n_estimators = [int(x) for x in np.linspace(start=1, stop=5000, num=100)]
subsample = np.linspace(start=0.01, stop=1, num=20)   

random_grid = {
    "learning_rate": learning_rate,
    "colsample_bytree": colsample_bytree,
    "max_depth": max_depth,
    "n_estimators": n_estimators,
    "subsample": subsample
}    


randomsearch = RandomizedSearchCV(
    xg_reg, param_distributions=random_grid, cv=10, n_iter=50
)    

randomsearch.fit(X_train, y_train)

After using the best parameters, I found out that the model is very good for my training data and terrible for the test data. So this might be an overfitting problem. However, most websites tell us to perform a cross-validation in order to avoid overfitting. But I already did that by using 'cv=10'. Also, they tell us to use another dataset in order to check if the model performs worse in this other dataset. But this doesn't solve the issue, just help you to confirm it.

So the question remains: What can I do now that I believe that my model is overfitted?

CV doesn't prevent overfitting, it merely detects over fitting. To avoid over fitting, you're going to need ore data or a less complex model. Also, you're only taking 50 combinations from an extremely large grid. The other possibility is that there is some set of parameters in that grid which appropriately fit the data, but you've just not given yourself the opportunity to find them. Increase n_iter and see what happens. — Demetri Pananos, Oct 04 '20 at 16:25

score 2 · Accepted Answer · answered Oct 04 '20 at 17:28

I think that this answer puts your options pretty clearly:

So while it is possible to overfit with a boosted model its also easy to dial back the tree dept[h], leaf size, learning rate etc and/or add in randomization to combat this.

What struck me most was your choice of tree depth possibilities:

max_depth = [int(x) for x in np.linspace(1, 1000, num=50)]

which evidently (I'm not fluent in Python) jumps immediately from a depth of 1 to a depth of 21 before going on up from there. The tree depth is the level of interactions among the predictors that the model will use. So you have jumped from no interactions among predictors (tree depth 1) immediately to interactions up to 21 ways, a level of interactions that would seem to be prone to overfitting regardless of how slowly your model learns. In contrast, in standard linear regression you will seldom see much more than two-way or three-way interactions, or tree depths of 2 or 3.

Instead of starting out blindly with a broad hyperparameter search, work cleverly by using a very slow learning rate, limiting tree depths to reasonable levels (say 3 or so) to start, and evaluating by CV the performance as a function of the number of trees (iterations). Allow for lots of trees to start to find an optimum number, working with a subset of data if the combination of slow learning and many thousands of trees poses computational difficulties. Then you can trade off learning rate against the number of trees to handle all the data, once you have a better idea of the most useful regions of your hyperparameter space.

What to do after knowing the model is overfitted?

1 Answers1