How do you optimize a classification model, when you only care about the top 5% of the ROC Curve?

Question

Imagine a real world scenario where you are only allowed to guess between 0 and 5% of the total population. You have to say "Here I think these 5% of people have trait A" and you aren't allowed to guess more than that. The other thing is, only 3-5% of the people have trait A. So it's not necessarily an easy trait to pick up on.

I guess I don't care about the entire AUC, I only care about the AUC between 0.95 and 1.00.

As an aside, most of the modeling I do is in R using caret, is there any simple setting to adjust in the metric that'd be much appreciated:

model  <- train(  y = y, x = x
                , metric = "ROC"
                , method = "rpart"
                , trControl = 5FoldsClass
                )

AUC is an indicator that describes a model, an analysis, and I think you are trying to link it somehow to a portion of your sample, which could cause confusion. Also, an ROC curve describes not cases (people) from your dataset, but the consequences of certain classfication decisions for rates of true and false positives and negatives. I think more detail about your problem, your variables, their distributions, and/or your sample size might help someone develop a helpful answer for you. — rolando2, Jan 21 '17 at 13:34
Thank you for your reply rolando2. Essentially, what is most important to me is my True Positive Rate, and my probability cut off can never go above 5% of the total data set. So I only care about how my models perform when in the 95th to 100th percentiles of the probability cut off and Sensitivity is the most important factor for me. Does that make more sense? Sorry if I'm still not explaining it well. — Factuary, Jan 23 '17 at 18:22

score 5 · Accepted Answer · edited Apr 13 '17 at 12:44

The general topic of binary classification with strongly unbalanced classes has been cover to a certain extent in the thread with the same name. In very short: caret does allow for more imbalance-appropriate metrics like Cohen's kappa or Precision-Recall AUC; the PRAUC is relatively new, you can find it using the prSummary metric. You can also try resampling approaches where you will rebalance the sample during estimation so class features become more prominent.

Having said the above, you seem to have a particular constraint about the total number of positives $N$ you can predict. I can think two immediate work-arounds. Both of them rely on idea you are using a probabilistic classifier. Simply put, a probabilistic classifier is a classification routine that can output a measurement of belief about its prediction in the form of a $[0,1]$ decimal number we can interpreter as a probability. Elastic nets, Random Forests and various ensemble classifiers do offer this usually out-of-the-box. SVMs usually do not provide out-of-the-box probabilities but you can get them if you are willing to accept some approximations. Anyway, back to the work-arounds:

Use a custom metric. Instead of evaluating the area below the whole PR curve we focus on the area that guarantees a minimum number of points. These are generally known as partial AUC metrics. They require us to define a custom performance metric. Check caret's trainControl summaryFunction argument for more on this. Let me stress you do not have to definitely look into an AUC. Given that we can estimate probabilities in each steps of our model training procedure, we can do an thresholding step within the estimation procedure right before evaluating our performance metric. Notice that in the case we "fix $N$", using the Recall (Sensitivity) value as a metric would be fine because it would immediately control for the fact we want $N$ points. (Actually in that case the Recall and Precision would be equal as the number of False Negatives would equate the number of False Positives.)
Threshold the final output. Given one can estimate the probabilities of an item belonging to a particular class, we can pick the items with the $N$ highest probabilities related to the class of interest. This is very easy to implement as essentially we apply a threshold right before reporting our findings. We can estimate models and evaluate them using our favourite performance metrics without any really changes in our work-flow. This is a simplistic approach but it is the easiest way to satisfy the constraints given. If we use this approach it will be probably more relevant to use an AUC-based performance metric originally. That is because using something like Accuracy, Recall, etc. would suggest using a particular threshold $p$ (usually $0.5$) to calculate the metrics needed for model training - we do not want to do that as we will not calibrate that $p$ using this approach).

A very important caveat: we need to have a well-calibrated probabilistic classifier model to use this approach; ie. we need to have good consistency between the predicted class probabilities and the observed class rates (check caret's function calibration on this). Otherwise our insights will be completely off when it comes to discriminating between items. As a final suggestion I would recommend that you look at lift-curves; they will allow you to see how fast you can find a given number of positive examples. Given the restriction imposed probably lift charts will be very informative and probably you want to present them when reporting your findings.

Thank you for your fantastic reply! It sounds like I was on the right track in terms of model choice. Attempting various models like Random Forest, GBM, xGboost, Lasso Regression. I tried your recommendation to using "Kappa" or "prSummary", which resulted in a ROC improvement, but at the end I'm given "Warning message: In train.default(y = Y, x = X, metric = "Kappa", : The metric "Kappa" was not in the result set. ROC will be used instead." is there a way to check all the various inputs I can use for the different attributes? Something like Options(train, metric)? — Factuary, Jan 26 '17 at 04:22
Cool, I am glad I could help. If you believe this answers your question you could consider accepting the answer. Some side notes: Refer to the [caret package website](https://topepo.github.io/caret/) on how to use different metrics. 'Kappa' is one of the standard metric returned by `twoClassSummary`; maybe check the `summaryFunction` argument used in `trainControl`. One more thing, you can see in the link [here](stackoverflow.com/questions/22434850/) how to define customer metrics. — usεr11852, Jan 26 '17 at 07:24
I think, to start, I'm going with your second option which is threshold(ing) the final output. It seems to be giving okay results compared to what we normally operate with. (it's very noisy). My one issue is in the model object I can't find the probabilities on the full model anywhere? I can see the all the fold results in model$pred, and it gives probabilities for each class, but where is that for the full model? I've googled like crazy and i'm pretty sure I checked every path down the lists. — Factuary, Jan 26 '17 at 07:30
To get the redicted probabilities for the whole dataset I think you just use `predict( type='prob',...)`. Of the top of my head I don't remember the syntax to specify reference classes in `caret` it is possible though. Worse case scenario just `relevel` your response vector. Questions particular on the use of an R-package (like `caret`) should be posted in [SO](http://stackoverflow.com/), it has quite a few `caret` queries already so you should get proper attention; Max Kuhn, the creator and maintainer of `caret` answers questions there from time to time too! — usεr11852, Jan 26 '17 at 23:48

How do you optimize a classification model, when you only care about the top 5% of the ROC Curve?

1 Answers1