0

The title essentially says it all. Below are some details regarding my data and model.

This is the current class distribution within my training set:

0    1353849
1      26217
Name: binary, dtype: int64

My training set includes 104 features.

My current recall is at 94%; My current precision is at 20%

Here are the hyperparameters for my XGBoost model:

nrounds = 500, eta = 0.2, max_depth = 20, subsample = 0.8, colsample_bytree = 0.2,reg_alpha=0.1, reg_lambda=0.8

I've tried SMOTE but it isn't working well likely cause of the high dimensionality. If you all have any recommendations, that would be much appreciated.

madsthaks
  • 121
  • 5
  • Given that only about 2% of your sample is $1$, a recall of 94% and precision of 20% don't seem bad to me, but of course that depends on your domain. Otherwise, I'd reduce my max_depth to maybe 5, my subsample to maybe 0.5, my eta to maybe 0.02... – jbowman Feb 27 '18 at 19:04
  • 2
    Is there a way to extract probabilities instead of classifications? Then set a threshold manually – ChootsMagoots Feb 27 '18 at 19:19
  • 2
    This question is effectively unanswerable, as it requires in depth knowledge of your data. The following are suggestions: 1) Are you looking at a single point of precision/recall? It is in fact a curve, depending on the threshold you choose. 2) Have you tried hyperparameter tuning?: https://stats.stackexchange.com/questions/171043/how-to-tune-hyperparameters-of-xgboost-trees – Alex R. Feb 27 '18 at 19:40

0 Answers0