2

I'm trying to evaluate the probability of a rare occurrence. My training data is a binary 1/0 for output and a TFIDF Vector of words for input.

Which seems to lend itself to a regression, I've been using xgboost's XGBRegressor with decent results but something is confusing me.

The feature_importances for XGBRegessor seem to pick ngrams which seem to me not that important.

When I ran the exact same test but instead used XGBClassifier the important phrases it chose made more sense to me.

The results of sklearns roc_auc score indicate similarly, it ranks the classifier as better than the regressor. Which I don't understand.

Can I compare the roc_auc score of a classifier and regressor as apples to apples?

Does it make sense that a classifier could perform better at predicting a probability than a regressor?

1 Answers1

1

This really should be done as a classification problem; xgb naturally produces a probability score (though generally not well-calibrated).

I would suspect that XGBRegressor would actually produce some outputs outside [0,1]; is that the case for you?

Gradient boosting is always a regression under the hood; the only difference is in the loss function. Using mse here will, for example, give the same penalty for predicting 1.2 as for predicting 0.8 for a positive sample. That's likely to mess with the rank-ordering, and hence the ROC curve.

Ben Reiniger
  • 2,521
  • 1
  • 8
  • 15
  • Can the results of the `roc_auc` can be taken as apples to apples across regression and calssification? I'm leaning towards making the switch it's just I started out confident it should be taken as a regression task because everyone online talks about rare probability events need to be evaluated as a regression task. I guess if XGB is all regression under the hood it makes sense. – Scott Thompson Jan 27 '20 at 23:06
  • I would say yes, because the ROC curve depends only on the rank-ordering. Can you provide reference(s) for rare-event classification being better treated as regression? – Ben Reiniger Jan 27 '20 at 23:41
  • https://stats.stackexchange.com/questions/290886/how-to-improve-rare-event-binary-classification-performance There was a few other posts also, Thanks though I think I'm going to probably move forward with classification as by all my measurements it's doing better... – Scott Thompson Jan 28 '20 at 02:19
  • 1
    Ah, I think Harrell's distinction between "classification" and "prediction" there is not quite in line with the sklearn API. Here, most classifiers are probabilistic; since Harrell mostly is arguing against hard classification, I think we're fine. – Ben Reiniger Jan 28 '20 at 04:13