2

I am using gbm(R's caret packages - using train function) on a class imbalanced data set with weights. So, class-1 has a weight of 1 and class-0 has a weight of 10. I am using parameter tuning and minimising AUC. I want to ask that is you are using weights in gbm with a class imbalanced data set then you are atificially making the classifier to put more focus towards the minority class and AUC/ROC is used mainly to check the sensitity & specificity trade-off. Does it make sense to minimise AUC with weights in GBM? or it should be accuracy? Please ignore my lack of understanding.

Thanks.

syebill
  • 153
  • 6
  • 1
    Auc makes sense for any model, regardless of how it's built as long as it's categorical. It's just a criteria for trade off between false pos and true pos. – meh Sep 19 '15 at 16:57
  • AUC is a poor choice of metric for class imbalanced data. Precision-recall works better than sensitivity-specificity. I'm not sure how the weighting will affect this though, but probably just in how you will weight the importance of precision vs recall for your use case – Dan Jul 31 '18 at 15:24
  • @Dan Sensitivity to class imbalance is an important distinction between PR and ROC curves, but I think that your explanation would greatly benefit from explaining why sensitivity to class imbalance is important. In my view, class imbalance is mostly an accident of data collection and data availability, so I prefer a measurement which isn't sensitive to that. – Sycorax Jul 31 '18 at 15:45
  • @Sycorax heavy class imbalance is surely most often caused by the underlying distribution of what is being measured, not noise to be ignored. Consider a dataset of transactions, we want to classify fraudulent transactions. It may be that 99.999% of transactions are not fraudulent. What does that have to do with data collection or availability. It's simply a property of the problem being solved. – Dan Jul 31 '18 at 16:01
  • You're probably right about fraud modeling. I don't know, I've never worked on that problem. But do you believe it's possible that there exists a problem where class imbalance *is* accidental? If so, that's a reason to prefer ROC analysis. That's all I'm saying -- use the tool that is appropriate to your task. – Sycorax Jul 31 '18 at 16:12
  • @Sycorax if the classes are imbalance at say 60-40 I wouldn't even consider that imbalanced. If it's around 90-10 then I don't think it's an accident. If I read imbalanced classes, to me means closer to 90-10 or worse. – Dan Jul 31 '18 at 16:16

1 Answers1

3

You can use ROC AUC to measure the performance of any binary classifier. There are also extensions of ROC analysis to multi-class classification.

But note that you want to maximize AUC: higher is better.

Sycorax
  • 76,417
  • 20
  • 189
  • 313
  • ROC AUC is not great for class imbalanced data: https://www.kaggle.com/general/7517 – Dan Jul 31 '18 at 15:27
  • You've misread the post. "A large number change in the **number of false positives** can lead to a small change in the false positive rate used in ROC analysis." Class imbalance doesn't influence FPR. – Sycorax Jul 31 '18 at 15:30
  • That's from the question... rather take a look at the answer. Or the paper. It affects how pronounced the difference between the numbers will be when comparing two classifiers. With heavily imbalanced classes, the ROC AUC can be extremely similar for two classifier with very different PR AUC. – Dan Jul 31 '18 at 15:33
  • @Dan It seems like you're saying the same thing with different implicit assumptions. I'm saying that a ROC curve is not sensitive to imbalanced classes, and I think that's a good thing. You seem to be saying that you would prefer to use a metric that is sensitive to class imbalance. – Sycorax Jul 31 '18 at 15:38
  • All I'm saying is that for this porblem, with imbalanced classes, there are many articles out there suggesting you should rather use PR AUC: https://stats.stackexchange.com/a/90783/40604 or https://stats.stackexchange.com/a/90783/40604 or that paper from the link http://pages.cs.wisc.edu/~jdavis/davisgoadrichcamera2.pdf – Dan Jul 31 '18 at 15:57
  • Sure, I've read that post and that article. What I'm trying to say is that PR curves **solve a different problem** and statements which make unqualified recommendations for one method over another are almost always concealing implicit assumptions. It's rare in statistics that one procedure is strictly better than another procedure -- usually there's some difference in underlying assumptions or goals. Understanding the differences in assumptions and goals is a core aspect of using statistics to solve real problems. – Sycorax Jul 31 '18 at 16:09
  • You said AUC ROC is fine for any classifier. I'm saying that there are circumstances when it might not be the best option. Not sure why my statement is 'unqualified' but yours ins't but OK. – Dan Jul 31 '18 at 16:37
  • It seems like you could write a good answer explaining (1) what PR curves are (2) what problems they solve and (3) how they solves those problems in a way that ROC curves do not. – Sycorax Jul 31 '18 at 17:16
  • I think the links above are sufficient – Dan Jul 31 '18 at 17:18