xgboost hyperparameters: interactions that make the model overfit on training set

Question

I am dealing with a classification problem on an unbalanced dataset (positive class is just above 1% of the sample).

I did hyperparameter tuning using a train-validation split, and then finally trained the model and checked my metrics of an interest on an unseen test set.

While these metrics were not unsatisfactory, upon checking the model on the training set I realized it completely overfit (area under the precision-recall curve is 1).

Are there general advices on the range to tune into, for each hyperparameter, depending on the size of the data (rows and columns)? e.g. the number of your estimators should be around sqrt(num_rows), or some advice of the sort.
Since my dataset was so unbalanced, I fine tuned on scale_pos_weight on a range of 80 to 100. I also fine tuned min_child_weight which ended up being 7. Does this mean that (putting max_depth aside for the time being) the tree was able to keep splitting until each positive observation was alone on a leaf node? Or do "weight" have two different meanings in the two hyperparameters' names?

xgboost hyperparameters: interactions that make the model overfit on training set

0 Answers0