5

I'm working on a binary classification problem, with imbalanced classes (10:1). Since for binary classification, the objective function of XGBoost is 'binary:logistic', the probabilities should be well calibrated. However, I'm getting a very puzzling result:

xgb_clf = xgb.XGBClassifier(n_estimators=1000, 
                            learning_rate=0.01, 
                            max_depth=3, 
                            subsample=0.8, 
                            colsample_bytree=1, 
                            gamma=1, 
                            objective='binary:logistic', 
                            scale_pos_weight = 10)

y_score_xgb = cross_val_predict(estimator=xgb_clf, X=X, y=y, method='predict_proba', cv=5)

plot_calibration_curves(y_true=y, y_prob=y_score_xgb[:,1], n_bins=10)

enter image description here

It seems like a "nice" (linear) reliability curve, however, the slope is less than 45 degrees.

and here is the classification report: enter image description here

However, if I do calibration, the resulting curve looks even worse:

calibrated = CalibratedClassifierCV(xgb_clf, method='sigmoid', cv=5)

y_score_xgb_clb = cross_val_predict(estimator=calibrated, X=X, y=y, method='predict_proba', cv=5)

plot_calibration_curves(y_true=y, y_prob=y_score_xgb_clb[:,1], n_bins=10)

enter image description here

What is more strange is that the outputted probabilities now clipped at ~0.75 (I don't get scores higher than 0.75).

Any suggestions / flaws in my approach?

Arnold Klein
  • 618
  • 8
  • 18
  • 2
    there's a good chance your model is poorly calibrated because you set `scale_pos_weight = 10`. Try re-running the model with `scale_pos_weight = 1`. – Zach Sep 27 '19 at 15:28
  • 1
    I suspect your learning rate is too low vs number of trees. Has the error converged after 1000 trees? – seanv507 Jul 19 '20 at 15:43
  • how could scale pos weight be affecting this? if you have class imbalance isn't this parameter needed? if it is important to need well calibrated probabilities i would suggest optimizing brier score – Maths12 Aug 19 '20 at 12:41

2 Answers2

8

I'm not sure "the objective function of XGBoost is 'binary:logistic', the probabilities should be well calibrated" is correct: gradient boosting tends to push probability toward 0 and 1. Furthermore, you're applying weights, which should also skew your probabilities.

Because gradient boosting pushes probabilities outward rather than inward, using Platt scaling (method='sigmoid') is generally not the best bet. On the other hand, your original calibration plot does look vaguely like the leftmost part of a sigmoid function. But that explains why your recalibrated scores get cut off at 0.75: fitting a sigmoid onto your calibration plot (which isn't actually what happens, but close enough) will have the right half of the sigmoid cut off.

For expediency, I would first try method='isotonic'. For better understanding, I would suggest shifting scores to account for the weighting you gave, and see where the calibration plot sits then. (The shifting correction is better documented for logistic regression, but see Does down-sampling change logistic regression coefficients? and Convert predicted probabilities after downsampling to actual probabilities in classification .

Finally, sklearn's calibration_curve uses equal-width bins by default, which in an inbalanced dataset is probably not best. You might want to modify it to use equal-size (as in, number of datapoints) bins instead to get a better picture. In particular, I suspect the last two points on your second calibration curve represent very few datapoints, and should be taken with a grain of salt. (In sklearn v0.21, this became easier with the new parameter strategy='quantile'.)

Ben Reiniger
  • 2,521
  • 1
  • 8
  • 15
  • It was my understanding that scale_pos_weight was used to weight gradient calculations, but not for the evaluation. It would make it different than plain oversampling. Any tought on that ? – lcrmorin Nov 07 '19 at 09:54
  • @lcrmorin, the gradient goes into the leaf scores: see eq5 in the paper (https://arxiv.org/pdf/1603.02754.pdf). It might help to think about the case without L2-regularization lambda=0, and loss=squared-loss so that h is constant. Then w^* is just the **weighted** average of the residuals in the leaf. See also https://stats.stackexchange.com/q/326110/232706 – Ben Reiniger Nov 08 '19 at 22:06
  • i am not understanding. if you have an imbalanced problem you are saying that we should use quantile binning? why? – Maths12 Aug 19 '20 at 12:36
  • 1
    @Maths12, for example in the post-calibration plot in OP, probably those two rightmost points consist of very few data points, and so are not very reliable. Adding some sort of error bar would be nice, but just ensuring that each plotted point summarizes a sufficient volume of the original data is helpful. – Ben Reiniger Aug 19 '20 at 14:01
  • so scale pos weight affects the weight formula (forumla 5 from the paper). If i am understanding this correctly then having a large scale pos weight will increase the weightings vector. This weights vector is given by eq 5 and is -ve. Therefore when it comes to calculating probabilities which is 1/(1+e^(-Xw)) , a high weighting vector will decrease the probability? since w is -ve – Maths12 Sep 01 '20 at 16:09
  • "having a large scale pos weight will increase the weightings vector" is not necessarily true. The `scale_pos_weight` affects both the numerator and denominator; in the simple case I set above (lambda=0, loss=square-loss), the class weights are used in a weighted average of the gradients. Note too that w_i in eq5 is not necessarily negative, since the gradients (residual, in our simple case) will often be negative themselves. – Ben Reiniger Sep 01 '20 at 20:15
  • thanks for this ben, another question, i am dealing with class imbalnce around 0.6%, and yet it is important for me to have well calibrated pribablities so i optimize on brier score. i get a very 'crazy' calibration curve i.e. it's like a zig zag. i do not use scale pos weight as i understand now what it does to the probabilities and results in uncalibratied curves. what else could i do to get better calibration? – Maths12 Sep 17 '20 at 13:42
  • @Maths12, I'd suggest a separate question. (Is it one of your DS posts already?) If your task is sufficiently hard, I'm not sure if there's much to be done; but others here might have good suggestions, and a new question will gather more attention than this comment. – Ben Reiniger Sep 17 '20 at 15:48
2

I'm not that familiar with gradient boosting, but I would assume that if you scale your minority class then your model will not be well calibrated. At the end of the day, it has learnt the distribution of the training data which does not reflect reality.

As for CalibratedClassifierCV, from reading the docs it seems that the sigmoid method is not applicable here given your distortion is not sigmoid shaped. Hence, if you have enough data that overfitting is not an issue, then why not try method='isotonic'?

Anonymous
  • 121
  • 2
  • why is it that if you scale your minority class then model will not be well calibrated? – Maths12 Aug 26 '20 at 15:25
  • A lot of times people do it unnecessarily especially with deep learning. But, if you have a small minority class with decision trees It can cause issues, you need enough examples of each class at each split for it to work effectively. – Anonymous Aug 27 '20 at 17:14